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Preface 


It was our privilege to serve as the program chairs for CAV 2021, the 33rd International 
Conference on Computer-Aided Verification. CAV 2021 was held as a virtual con- 
ference during July 20-23, 2021. The tutorial days were on July 19 and July 24, 2021, 
and the pre-conference workshops were held during July 18-19, 2021. Due to the 
COVID-19 outbreak, all events took place online. 

CAV is an annual conference dedicated to the advancement of the theory and 
practice of computer-aided formal analysis methods for hardware and software sys- 
tems. The primary focus of CAV is to extend the frontiers of verification techniques by 
expanding to new domains such as security, quantum computing, and machine 
learning. This puts CAV at the cutting edge of formal methods research, and this year’s 
program is a reflection of this commitment. 

CAV 2021 received a very high number of submissions (290). We accepted 16 tool 
papers, 3 case studies, and 60 regular papers, which amounts to an acceptance rate of 
roughly 27%. The accepted papers cover a wide spectrum of topics, from theoretical 
results to applications of formal methods. These papers apply or extend formal methods 
to a wide range of domains such as concurrency, machine learning, and industrially 
deployed systems. The program featured keynote talks by Loris D’Antoni 
(UW-Madison), Corina Pasareanu (NASA), and Anna Slobodova (Centaur Technol- 
ogy, Inc.) as well as invited tutorials by Nate Foster (Cornell University), Zak Kincaid 
(Princeton) together with Tom Reps (UW-Madison), and Nadia Polikarpova (UC San 
Diego). Furthermore, we continued the tradition of Logic Lounge, a series of discus- 
sions on computer science topics targeting a general audience. 

In addition to the main conference, CAV 2021 hosted the following workshops: 
Formal Approaches to Certifying Compliance (FACC), Formal Methods for 
ML-Enabled Autonomous Systems (FoMLAS), Formal Methods for Blockchains 
(FMBC), Numerical Software Verification (NSV), Theory and Practice of String 
Solving (TPSS), Verifying Probabilistic Programs (VeriProP), Synthesis (SYNT), 
Satisfiability Modulo Theories (SMT), and Verification Mentoring Workshop (VMW). 

Organizing a flagship conference like CAV requires a great deal of effort from the 
community. The Program Committee for CAV 2021 consisted of 79 members — a 
committee of this size ensures that each member has to review only a reasonable 
number of papers in the allotted time. In all, the committee members wrote over 900 
reviews while investing significant effort to maintain and ensure the high quality of the 
conference program. We are grateful to the CAV 2021 Program Committee for their 
outstanding efforts in evaluating the submissions and making sure that each paper got a 
fair chance. Like last year’s CAV, we made the artifact evaluation mandatory for tool 
paper submissions and optional, but encouraged, for the rest of the accepted papers. 
This year saw an unprecedented number of 66 artifact submissions. The Artifact 
Evaluation Committee consisted of 72 members who put in significant effort to eval- 
uate each artifact. The goal of this process was to provide constructive feedback to tool 
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developers and help make the research published in CAV more reproducible. We are 
also very grateful to the Artifact Evaluation Committee for their hard work and ded- 
ication in evaluating the submitted artifacts. 

CAV 2021 would not have been possible without the tremendous help we received 
from several individuals, and we would like to thank everyone who helped make CAV 
2021 a success. First, we would like to thank Clément Pit-Claudel and Maria Schett for 
chairing the Artifact Evaluation Committee and John Cyphert for putting together the 
proceedings. We also thank Arie Gurfinkel for chairing the workshop organization, 
Bor-Yuh Evan Chang for managing sponsorship, Thomas Wies for arranging student 
fellowships, Norine Coenen for handling publicity, Leopold Haller for organising the 
Logic Lounge, and Peter Miiller for putting together the Ask me Anything program. We 
also thank Jean-Baptiste Jeannin and Arjun Radhakrishna for chairing the Mentoring 
Committee. Putting together an online conference is a complex task and we are grateful 
to the virtualization chair Tiago Ferreira, the student volunteer coordinators Tobias 
Kappé and Tao Gu, the local organizers for the Asia timezone, Ichiro Hasuo and 
Krishna S, and the team at Slides Live for all their efforts. Last but not least, we would 
like to thank the members of the CAV Steering Committee (Kenneth McMillan, Aarti 
Gupta, Orna Grumberg, and Daniel Kroening) for helping us with several important 
aspects of organizing CAV 2021. 

We hope that you will find the proceedings of CAV 2021 scientifically interesting 
and thought-provoking! 
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Abstract. We present NNREPAIR, a constraint-based technique for 
repairing neural network classifiers. The technique aims to fix the logic 
of the network at an intermediate layer or at the last layer. NNREPAIR 
first uses fault localization to find potentially faulty network parameters 
(such as the weights) and then performs repair using constraint solving 
to apply small modifications to the parameters to remedy the defects. 
We present novel strategies to enable precise yet efficient repair such 
as inferring correctness specifications to act as oracles for intermediate 
layer repair, and generation of experts for each class. We demonstrate 
the technique in the context of three different scenarios: (1) Improv- 
ing the overall accuracy of a model, (2) Fixing security vulnerabilities 
caused by poisoning of training data and (3) Improving the robustness of 
the network against adversarial attacks. Our evaluation on MNIST and 
CIFAR-10 models shows that NNREPAIR can improve the accuracy by 
45.56% points on poisoned data and 10.40% points on adversarial data. 
NNREPAIR also provides small improvement in the overall accuracy of 
models, without requiring new data or re-training. 


1 Introduction 


Neural networks have many applications, being used for example in pattern 
analysis, image classification, or sentiment analysis for textual data, and also in 
medical diagnosis or perception and control in autonomous driving, which bring 
safety and security concerns [10]. These systems learn the network parameters 
(weights and biases) through training on a set of labeled examples. The per- 
formance of the trained networks is independently validated by computing the 
accuracy on a held-out labeled test set. 

Just like other software systems, trained neural networks can have defects 
that need repair. For example, a trained neural network may have low accuracy 
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which may be due to limited training data. One would like to repair the network 
by modifying its parameters (or a subset of them) to improve its overall accu- 
racy, even in the absence of additional training data. In another scenario, the 
training data for a neural network has been poisoned by an adversary leading to 
high accuracy on normal data but poor accuracy on poisoned data [6,7,11]. In 
this case, one would like to repair the network to remedy the defect while still 
maintaining a high accuracy on non-poisoned data. In yet another scenario, a 
trained network may have high accuracy on the test set but may be vulnerable 
to adversarial perturbations, i.e., small modifications to the inputs that lead to 
unexpected outputs. Recent studies [8,15,20] show that this defect is very com- 
mon even for highly trained, highly accurate networks. In this case, one would 
like to repair the network to make it robust against adversarial perturbations 
while at the same time retaining its accuracy on the normal, unperturbed test 
set. 

Retraining could be used to alter the neural network parameters and repair 
for faults, but it can be very difficult and expensive subject to uncertainties, 
and may result in a network that is quite different from the original one, thus 
wasting the effort of the original training. 

We present a novel constraint-solving based approach, NNREPAIR, to repair 
neural networks trained for the task of classification, with respect to all three sce- 
narios described above. Similar to traditional program repair [5, 13,22], NNRE- 
PAIR first uses fault localization to identify the network parameters that are the 
likely source of defects, followed by repair, which uses constraint solving to apply 
small modifications to the network parameters to remedy the defects. 

Given a trained neural network model, the potentially faulty components 
could be the architecture of the model (which is fixed in the design stage) or the 
learn-able parameters such as the weights and the biases (which are determined 
during training). In this work, we focus on the learn-able parameters of a neural 
network model, specifically the weights on the edges connecting neurons. As 
observed in [9], changing the weights is a common fix for neural networks. 

We leverage the organization of a neural network into layers and the natural 
decomposition of computation that each layer provides, and scope our work 
to focus the repair on a single layer of the network. Repairs across multiple 
layers are possible, but they would be less scalable and involve more complex 
modifications. We propose two types of repairs: intermediate-layer repair and 
last-layer repair. Intermediate-layer repair attempts to fix failures by modifying 
the behavior of neurons at an inner layer of the network. Last-layer repair, on 
the other hand, attempts to modify the decision constraints at the last layer. 

Fault localization is used to mark one or more neurons at a layer as ‘suspi- 
cious’ and to find a sub-set of incoming edges to the suspicious neurons, whose 
weights will be the target for repair. The repair process involves solving con- 
straints collected from the network, via a simple form of concolic execution [17]. 
For last-layer repair, the oracle of the repair is the desired label for every fail- 
ing input and the repair constraints encode this decision. For intermediate-layer 
repair, we propose a novel use of activation patterns representing specifications 
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of correct behavior at the layer [4] as oracles for repair. This enables us to keep 
the repair local to the layer and therefore efficient. 

Furthermore, to make the constraint solving scalable, instead of solving for 
constraints for all classes at once, we propose to decompose the repair into a 
set of sub-tasks, one for each output class. Specifically, we set-up the constraint 
solving to correct a subset of the weights with the goal of improving accuracy of 
the model wrt a specific output class. The result of this repair is a set of experts, 
which are neural networks that improve accuracy of the network wrt specific 
output classes. We then combine the experts to obtain the final repaired model. 

There are a few recent related techniques that propose to use constraint solv- 
ing for neural network repair. We summarize them in Sect. 6. These techniques 
tend to focus on last layer repair while we also propose repair at an interme- 
diate layer. Furthermore, we evaluate our initial prototype in three scenarios: 
improving accuracy, robustness and resilience towards poisoned data. None of 
the related techniques address all three (albeit potentially possible). 

We summarize our contributions as follows. 


— We propose and implement a repair technique that applies fault localization 
and constraint solving to neural networks. Our approach can perform both 
last and intermediate layer repair. 

— To achieve scalability, our approach decomposes the repair into a set of experts 
which display superior accuracy for specific labels. These are then combined 
using a set of strategies that we propose to obtain the final repair. 

— We present a novel technique to make it more efficient to repair inner layers of 
a neural network by inferring specifications of correct behavior (in the terms 
of the activation patterns) at the output of inner layers, and using them as 
oracles for repair. 

— While previous neural network repair techniques (see Sect.6) tend to focus 
solely on improving accuracy, we demonstrate our technique in the context of 
three different scenarios: (1) Improving the overall accuracy of a model, (2) 
Fixing security vulnerabilities caused by poisoning of training data and (3) 
Improving the robustness of the network against adversarial attacks. 

— We evaluate the techniques in the context of image classifiers for the MNIST 
and CIFAR-10 data sets. The results indicate that NNREPAIR can improve 
the performance of the network by 45.56% points on poisoned data and 
10.40% points on adversarial data. NNREPAIR also provides small improve- 
ment (+0.20% points), in the overall accuracy of models, without requiring 
new data or re-training. 


2 Background 


Neural Networks. In this work we focus on neural network classifiers. These 
networks take in an input, such as an image, and output a class (or label) 
specific to the problem they have been trained to solve. Networks are organized 
in layers of different types, including convolutional, activation, and pooling, each 
of which has a number of nodes. For this paper, we focus on activation layers. 
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Each node from the previous layer will output into the associated node in the 
activation layer, which will apply an activation function. Common activation 
functions include linear rectification (a.k.a. ReLU) and sigmoid. For simplicity 
we discuss here ReLU activations but our work applies to arbitrary activations 
as discussed below. Let N(X) denote the value of a neuron as a function of the 
input. N(X) = X; wi- Ni(X) + 6 where N;,’s denote the values of the neurons 
in the previous layer of the network and the coefficients w; and the constant b 
are referred to as weights and bias, respectively. If this function evaluates to a 
non-negative value, the node is activated and outputs that value, otherwise it 
outputs 0. A final decision (logits) layer produces the network decisions based on 
the real values computed by the network, by applying e.g., a softmax function; 
in our work we use the max function instead. For a comprehensive introduction 
to neural networks, see [3]. 


Activation Patterns. We leverage previous work [4] to infer network properties 
based on the activation patterns of neurons in the network. We will use these 
activation patterns as oracles for the intermediate layer repair. An activation 
pattern o specifies an activation status (on or off ) for some subset of neurons at 
a layer in the network. All other neurons do not matter. We write on(c) for the 
set of neurons marked on, and off (co) for the set of neurons marked off in the 
pattern o. Each activation pattern o defines a predicate o(X) that is satisfied by 
all inputs X whose evaluation achieves the same activation status for all neurons 
as prescribed by the pattern. 


o(X)z= AN N(X)>0A A MX)<0 (1) 
N€on(o) NEoff (o) 


A decision pattern o is a property wrt network F and postcondition P if: 
YX :o(X)=> P(F(X)). (2) 


A postcondition for a classification network is that the top predicted class is 
C, i.e., P(Y) := argmaz(Y) = C. 

The previous work [4] also describes how to compute activation patterns. 
The idea is to observe the activation signatures of a large number of inputs and 
apply decision tree learning over them to infer activation patterns that are thus 
empirically valid. We adopt the same approach here. The support of a pattern is 
formed by all the inputs that satisfy the pattern. We are interested in computing 
high-support patterns as they are the most likely to reflect valid properties of 
the network. 


3 Example 


This section demonstrates Intermediate-layer and Last-layer repair on a simple 
example. Figure 1 shows a simple two-input network with two hidden layers; each 
containing two ReLU nodes (ReLU(x) = x (on) if x > 0, 0 (off) otherwise), and 
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1% repair: -1.9 2"4 repair: 1.6 


Fig. 1. Example 


Table 1. Data for example 


Xo | xı | No Nı N2|Ns| yo yı |class | ideal 
xal 1) 1| 1 1 0 1 8 6 0 0 
Xı 1/1 1 1 1 | 0.25 | 9.25 1 1 
X2 1/0] 0 1 0 1 3 2.25 0 0 
X3|—1| 1) 1 1 1 |] 0 —7 |13.12 1 1 
zal L5 2 I il I 1 |12.68 | 12.68 1 0 
after repair: | 1.5] 2 1 0 1 | 13.3 | 10.5 0 0 
xe 10.6) 1| 1 { 1 1 | 5.91 | 5.62 0 1 
after repair: | 0.6| 1 | 1 1 1 1 | 5.91 | 5.95 1 1 


two outputs, yo and y1. The weights are depicted on the edges between nodes. 
For simplicity we assume biases are 0. The input X, which is a two-element 
array denoted [zxo, %1], is assigned class 0 if yọ > yı and 1 otherwise. Let us 
assume the model behaves correctly on the first four inputs shown in Table 1. The 
table also shows the decisions of the ReLU activations for nodes No, N1, N2, N3, 
respectively. Whenever a ReLU node is on, the decision is indicated as a 1 and 
if it is off, then the decision value is shown as 0. 

Consider now the input X4 = [1.5,2.0]. Assume this input is mis-classified; 
the output class is 1 but the ideal class is 0. The inaccuracy of the model could 
be a result of insufficient training. We then aim to build a repair, which in our 
case focuses on a single layer of the network and modifies the weights feeding 
into the neurons at that layer. 

We keep the repair local to the layer by using activation patterns [4] in 
lieu of the decision constraints. The insight in [4] is that the logic that every 
layer implements could be captured as rules in terms of the activation patterns 
of the neurons. We can observe in the example, that for all inputs correctly 
classified with label 0, the neuron pair (N2, N3) in the second layer has the 
activation pattern (off, on). For the failing input, this pattern is not satisfied; 
in fact the activation for (N2, N3) for the failing input is (on, on). We use the 
above observation to fix the failure by performing intermediate layer repair. We 
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Fig. 2. Overview of the approach 


aim to modify the neuron activations of the second layer on the failing input to 
satisfy the correct-label pattern for class 0 at the layer. 

We aim to perform the repair by making minimal changes to the model. We 
identify the weights to be modified using an attribution-based approach and 
use constraint solving to compute the values of the new weights (see Sect. 4 for 
details). Changing the weight of a single edge, connecting N; and No from —1.5 
to —1.9 changes the activation pattern for (N2, N3) to (off, on) on the failing 
input, while preserving the behavior of the neurons (in terms of their activation 
pattern) and the output of the model on the passing inputs. 

Consider now another input for the above-corrected network, Xs = [0.6, 1.0]. 
This input is very close to X, = [0.0, 1.0] (correctly classified to 1) with a small 
change to zo that makes the model mis-classify the input to 0. This represents a 
typical adversarial scenario where a correctly classified input is perturbed slightly 
to create an input that ‘jumps’ the decision boundary of the network leading 
to a mis-classification. It can be observed that the activation patterns of the 
internal layer neurons for Xs are the same as for the correctly classified input 
Xı, thus an intermediate-layer repair would not work for this input. Therefore 
we perform last-layer repair. We localize the weights of the edges in the last layer 
that need repair. Changing the weight on the edge between N3 and yı (from 1.5 
to 1.6) corrects the class for the failing test to 1, while retaining the same labels 
for the other inputs. 


4 Approach 


Figure 2 gives an overview of our approach. We aim to repair a faulty trained 
neural network classifier, which is given as input. As in other repair approaches, 
we consider both positive and negative examples for the repair. The negative 
examples are used to guide the repair towards correcting the faults while the 
positive examples are used to constrain the repair to not damage the existing 
good functionality of the network. We aim for a repair strategy that is scalable 
and applies small changes to the network. We therefore target the repair on 
a single layer of the network. Repairs across multiple layers of the network are 
possible, but they would be less scalable and involve more complex modifications. 
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Unlike all previous work, which tends to focus the repair at the last layer 
(see Sect.6) we propose here techniques for both intermediate and last layer 
repair. Intuitively, a last layer repair is easier as it aims to modify the weights 
that impact directly the decisions, and can use the network’s output as an oracle 
to guide the repair. However the resulting repair may not generalize well and 
furthermore the network may be faulty at some intermediate layer. A repair 
at an intermediate layer can have a higher impact over the network’s behavior 
but it is more difficult as it is not clear what oracle to use to guide the repair. 
One can use the output of the network as the oracle but this may result in an 
un-manageable large number of constraints to solve. In this work we propose a 
novel use of neuron activation patterns to act as oracles in intermediate layer 
repair. 

As repairing for all the output classes at the same time can be very difficult, 
our proposed approach obtains instead a set of expert networks, one for each 
target class, which are easier to compute. These experts are combined to obtain 
a final repaired classifier. Specifically, our repair strategy has the following steps: 


1 Fault Localization: The goal of this step is to identify a small set of suspicious 
neurons and incoming suspicious edges, whose weights we aim to correct. 

2 Concolic Execution: For the weights of the suspicious edges, we add 6 values 
that are set to 0 in concrete mode, but are designated as symbolic for the 
symbolic mode. The network is executed concolically along positive and nega- 
tive examples, to collect the values of suspicious neurons in terms of symbolic 
expressions. 

3 Constraint Solving: The symbolic expressions are assembled with a set of 
repair constraints which are then solved with an off-the-shelf solver. Essen- 
tially, the repair constraints need to encode the network decision for the 
positive examples and modify (i.e., correct) the network decision for the neg- 
ative examples. For the last layer repair this amounts to adding constraints 
imposed by the decision layer. For intermediate-layer repair, we use activation 
patterns instead of decision constraints, allowing us to keep the repair local 
to the layer. 

The solutions for the symbolic 6’s obtained from the solver are used to update 
the weights of the network, thus obtaining an expert for a specific class. 

4 Combining Experts: Finally the experts obtained for each class are combined 
to obtain the repaired classifier. This needs to be done carefully, to avoid 
redundant computations among experts and to not damage the overall accu- 
racy and timing performance of the classification. 


In the following we give more details about our approach. 


4.1 Intermediate-Layer Repair 


Fault Localization. We explore the usage of activation patterns of the network 
(Sect. 2) to act as oracles of correct behavior. We also use these patterns to guide 
the identification of potentially faulty neurons. Specifically, we use the decision- 
tree learning approach from [4] to extract correct-label patterns corresponding to 


10 M. Usman et al. 


every output class at an intermediate layer. Each pattern is satisfied by a group 
of inputs correctly classified to a certain label. Typically multiple correct-label 
patterns are generated. We select the ones with the highest support, which are 
mostly likely to hold true on the network for all inputs. Note also that the work 
in [4] considers ReLU activations but it could be extended to consider arbitrary 
linear or non-linear activation functions, by comparing the values of neurons 
with a threshold. 

A correct-label pattern with high support at a layer indicates that there 
is a high chance that any input satisfying the pattern at the layer would be 
classified by the network to the corresponding label. Furthermore, a mis-classified 
input will not satisfy the correct-label pattern for the respective ideal label. For 
every failing input, we compare the activations of the neurons with those in the 
respective correct-label pattern and consider those neurons whose activations 
differ as the potentially faulty ones. The repair then aims to change the outputs 
of the neurons for each of the failing inputs, such that they satisfy the correct- 
label pattern for their ideal labels. 

In this work, we select a dense layer (i.e., a fully connected layer which 
receives input from every neuron in the previous layer) with ReLU activations. 
Typically such dense layers appear closer to the output and may impact the 
classification decision more than convolutional layers which process the input. 
Further, the number of neurons at fully connected layers is typically smaller than 
at other layers making the pattern-extraction process efficient. 

Consider a mis-classified input, Xs with ideal label C. Let oc be the correct- 
label pattern with highest support for C. Let L be the layer for this pattern, and 
let N denote a neuron at layer L. Then the set of suspicious or faulty neurons 
Nfaulty can be defined as follows; 


N E€ Nfauty — > (N E on(oc)AN(Xf) < O)V(N € off (oc)AN(X#) > 0) (3) 


Once the neurons whose outputs need to change are identified, we also need to 
identify the incoming edges to those neurons whose weights we aim to modify. We 
use a simple statistical method to identify the important weights which impact 
the respective neuron’s output, more for the failing inputs as compared to the 
passing inputs. 

Consider a set Fail of failing inputs with the same ideal label C and a set 
Pass of passing inputs. We use #(-) to denote the cardinality of the sets. The 
defect score for each edge is determined as follows. 


= Xe Fail |Ni(X) - wil sea |Ni(X) - wil (4) 
# Fail # Pass 


Here E; denotes an incoming edge (for a faulty node N), N; is the corre- 
sponding node in the preceding layer and w; is the weight of the edge. 

Thus, we take the average of the absolute values passing through the edge for 
all the negative examples for C and the average of the absolute values passing 
through the edge for all the positive examples and subtract them. The intuition 
is to identify the edges which have more influence on the incorrect decision of the 


Score( Ej) :: 
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network. We calculate the defect score for each incoming edge to each neuron 
(N) in Nfautty. We then select the edges with top n% of the scores to create the 
set of faulty edges, for a small n. 


Concolic Execution. We perform a simplified form of concolic execution to form 
symbolic constraints for suspicious neurons. For the weights of the suspicious 
edges, we add 6 values that are set to 0 in concrete mode, but are designated 
as symbolic in the symbolic mode. The network is executed concolically along 
both positive and negative examples, to collect the values of neurons as weighted 
sums in terms of both concrete values and the symbolic 6 values. The value of a 
neuron is computed as constraints of the following form: 


SYMy, x =Ņ_ (wi + &) -N(X)+ Sow; -Nj(X) +b (5) 
i j 

Here Symyx is a fresh symbolic variable introduced to encode the symbolic 
value of neuron N for input X, w;’s denote the weights of the suspicious edges 
(in the suspicious layer) while the w;’s denote the other weights, which do not 
need modification. Furthermore, N;(X), N;(X) represent the concrete values of 
the neurons coming from the previous layer. Note that no expensive constraint 
solving is needed in this step. 


Repair Constraints and Constraint Solving. For intermediate-layer repair, we 
add the activation patterns constraints (that imply the decision constraints, see 
Eq. 1) to the set of constraints. Specifically, for each neuron N in N faulty; and 
for each (passing or failing) input X we add Symyx > 0 if N € on(oc) and we 
add Symyx <0 if N € off (o0). 

The solutions for the symbolic 6’s obtained from the solver guarantee that 
all the inputs (both passing and failing) satisfy the pattern and are thus likely 
to be classified as C by the network. These solutions are then used to update 
the weights of the network, thus obtaining an expert for the class C. 


Example. Let us consider the example from Sect. 3, the case of the intermediate- 
layer repair. As already discussed in Sect.3, let us suppose we consider the 
activation pattern for class 0 at layer 2. We select Na as the target for repair 
(since its activation along the failing test X4 is on instead of off) and we want 
the input to satisfy the pattern {off, on} for { N2, N3}. We compute defect scores 
for the incoming edges to N2 using the failing input and all passing inputs for 
classes 0 and 1. The score of the edge between No and No is 2.0 while the score 
of the edge between N; and No is 2.81, we therefore select the second edge as a 
target for repair. We then build the following constraints from the failing test. 
SyMy, 4 = 2-0: (—1.0- 1.5 + 2.0: 2.0) + (—1.5 + ô) - (0.5-1.541.0-2.0) A Symy, 4 < 0.0 
Similarly, we build constraints from the passing tests that satisfy the pattern 
for label 0, Xo and X2: 
Symy,,o = 2-0: (—1.0-1.0+ 2.0- 1.0) + (—1.5 +ô) - (0.5: 1.0+ 1.0: 1.0)^ Symy, o < 0.0A 
Symy, > = 2.0- (—1.0- 1.0 + 2.0-0.0) + (—1.5 + ô) - (0.5-1.0 +1.0-0.0)^ Symy, > < 0.0 


12 M. Usman et al. 


In practice we also add some constraints on 6 to keep it small but we omit 
them here for simplicity. A solution for all the constraints is 6 = —0.4 which is 
used to update the weight for the target repair resulting in an expert for class 0. 


4.2 Last-Layer Repair 


Fault Localization. In a classifier network the last layer typically contains as 
many neurons as the number of classes. An input is classified to label C, if the 
output of the respective neuron is greater than the values of all other output 
neurons. It is therefore natural to designate this neuron as suspicious for target 
class C. Let Nc denote the neuron at the last layer corresponding to a class 
C. We use the same technique as in intermediate layer repair (Eq. 4) to localize 
edges and short-list the important weights which are the target for repair. 


Concolic Execution. Similar to the intermediate layer repair, we add symbolic 
ô values to the important weights and perform concolic execution along pass- 
ing and failing tests to create the symbolic expression for the node Symyx 
(following Eq. 5). 


Repair Constraints and Constraint Solving. We then add the decision constraints 
for the passing and failing inputs: 


i Symne,x > SYMN,,,.x (6) 
C#C' 

The obtained solutions guarantee that all the inputs that were used in the 
repair (both positive and negative) are classified to the correct class. The solu- 
tions are used to build the expert for each class. We then combine the experts 
using the combination strategies outlined in the next section. 


Example. Consider now the example from Sect. 3, the case of the last-layer repair. 
As we aim to repair for class 1 we select for repair the neuron named yı in the 
figure. The score for the edge between Nə and yı is —2.75 and the score for the 
edge between N3 and yj is 0.45 so we select the latter for repair. We then build 
the following constraints based on the failed test (note that the expression for 
the second variable simplifies to a concrete value): 


Sym,, 5 = (2.5: (2- (—1- 0.6 +2- 1.0) — 1.9- (0.5-0.6 + 1-1.0)) + (1.5 + 8) -(-0.5-(-1- 
0.6 + 2- 1.0) + 3- (0.5 -0.6 + 1- 1.0)))^ 

Symy,5 = (—1.5 - (2- (—1 -0.6 + 2- 1.0) — 1.9 - (0.5 - 0.6 + 1 - 1.0)) + 2.0 - (—0.5 - (—1- 
0.6 + 2- 1.0) + 3- (0.5 -0.6 + 1 - 1.0)))^ 

SYM 5 > SyMy, 5 


Similar constraints are added for the positive inputs (we omit them here for 
brevity). Solving these constraints gives 6 = 0.1 which is added to the weight for 
the edge between N3 and yı to obtain an expert for class 1. 
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4.3 Combining Experts 


We create experts for each label in the dataset. For example, for a neural network 
trained on the MNIST data set (which is used for the classification of handwritten 
digits from 0 to 9), we generate 10 experts — one expert per label. We propose 
three variants of how to combine these experts: 


(A) execute the model for all experts and combine the results afterwards, 
(B) merge all experts into one combined expert before model execution, and 
(C) filter strong experts first, then follow variant (A) or (B). 


Variant (A) is an instance of ensemble modeling [1], which typically involves 
creating multiple models to predict an outcome. In our case, we start by execut- 
ing all the experts for each input. This is done in a combined fashion, to avoid 
repeated execution of same code: before the repaired layer the model is executed 
with the original weights; starting from the repaired layer the execution is split 
up for the different experts. At the end of the execution, each expert classifies 
the input to a certain label. We need to combine the results from all the experts 
in order to classify the input to a single label. 

Each expert can classify the input to any of the labels, however, each expert 
can be trusted to produce the correct result only for its own respective label. 
Therefore, we start by generating a set E including the experts that classify 
the inputs to their respective labels. Note that it could be that multiple experts 
report that the given input belongs to their respective class or it could be that 
no expert classifies the given input to the expert’s class. If E is empty, then we 
select the label by the original model. If there is one expert in Æ, then we select 
this unique expert. If there are multiple experts in E, then we need to resolve 
the conflict between experts and choose one label, for which we propose three 
strategies: 


Naive: This strategy simply falls back to the original model. 

Confidence: This strategy selects the expert from E with the highest confidence 
for its own label, i.e., the absolute value of the output node corresponding to 
the label. 

Voting: For the label corresponding to each expert in Æ, this strategy collects 
votes from the other experts for the respective label. It then selects the expert 
from E with the majority of the votes. 


In variant (B), we propose to merge the experts before executing the model. 
For the intermediate-layer repair, for every weight that is considered faulty we 
update it with the one 6 value, which is the average of the solutions from all 
the experts. This creates a single merged network. For the last-layer repair, we 
simply apply all the repairs at once; there is no need for an average as the nodes 
(and edges) that are targets for repair are disjoint. 

In variant (C), instead of using all experts we select a subset of strong experts. 
Note that each expert is constructed from failing inputs only for the respective 
label. Therefore, when exposed to data which are supposed to be classified to 
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the expert’s label, the expert displays higher accuracy than the original model 
(higher recall). However, when exposed to data which can belong to different 
labels, the experts could display lower overall accuracy than the original model 
(lower precision) due to high false positives. Therefore, we determine which of 
the experts have both their precision and recall (F1 score), computed over all 
positive/negative inputs, higher than the original model and retain only those 
while filtering out the rest. The same combination strategies, variant (A), are 
used to obtain a single classification result for the input. 


5 Evaluation 


We implemented our approach in the NNREPAIR tool pipeline, which is based 
on NEuROSPF [21]. It first translates a trained Keras model into Java, uses 
Symbolic PathFinder (SPF) [16] for concolic execution and z3 [14] for constraint 
solving. In this section we evaluate NNREPAIR by considering its application 
to three highly common scenarios; Scenario 1: improving accuracy, Scenario 2: 
fixing backdoor attacks, and Scenario 3: enhancing adversarial robustness. Our 
experiments use two commonly used datasets for image classification networks, 
MNIST and CIFAR-10. We consider two architectures for MNIST with 10 and 
7 layers respectively. They are convolutional neural networks (CNNs) and have 
the typical structure of modern neural networks such as convolutional/dense, 
max-pooling and softmax layers. The first MNIST model has an accuracy of 
96.34% on the standard test set, while the second model has an accuracy of 
98.89%. We refer to these models as MNIST-LQ (low-quality) and MNIST-HQ 
(high-quality) respectively. The CIFAR-10 model is a 15-layer CNN with 890k 
trainable parameters and has an accuracy of 81.04%. In order to validate our 
approach, we consider the following research questions: 


RQ1 Is NNREPAIR successful in correcting the defects in all three scenarios? 

RQ2 How do intermediate-layer repair and last-layer repair compare with each 
other? 

RQ3 What is the inference time overhead introduced by NNREPAIR over the 
original model? 


5.1 Scenarios 


(1) The goal of repair in the first scenario is to improve the overall accuracy of 
a model. We measure the improvement in accuracy on the standard test set, 
henceforth denoted Test. We use positive and negative examples from the 
train set, henceforth denoted Train, to generate the repair. 

(2) For this scenario, we apply the backdoor attack from [6]. Samples of poisoned 
data are shown in Fig.3. The poisoned models have good accuracy on the 
standard data, but poor accuracy on the poisoned data. The goal of the 
repair is to improve the accuracy on poisoned data, which we measure on a 
separate poisoned test set P-Test. At the same time, we expect the repair 
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Fig. 3. Example poisoned data for MNIST (left) and CIFAR-10 (right). The backdoor 
is embedded as the white square at the aa right corner of each image. When the 
backdoor appears, the poisoned MNIST model will classify the input as “7” and the 
poisoned CIFAR-10 model will classify it as “horse”. 


to retain the accuracy on standard, un-poisoned data, which we measure on 
Test. In this scenario, the first 600 inputs in Train are poisoned (P-Train). 
We draw from these particular inputs to get the negative examples to focus 
the repair on the defect. We draw the positive examples from Train. 

(3) For the last scenario, we apply adversarial perturbations over Train and Test 
using FGSM!, for e = 0.05. This results in four data sets: Train, Adv-Train, 
Test and Adv-Test. The models have good overall accuracy on Train and 
Test, but poor accuracy on Adv-Train and Adv-Test. The goal of the repair 
here is to improve the accuracy on the adversarial data (which we measure 
on Adv-Test) without damaging too much the accuracy on standard data 
(which we measure on Test). We draw the negative examples to be used in 
repair from Adv-Train, while we use positive examples from both Adv-Train 
and Train. Since we use two separate sets to generate experts, when comput- 
ing the F1-score for selection of experts, we explored two different options: 
computing F1 score over Adv-Train only and computing harmonic mean of 
the F1 scores computed over Train and Adv-Train separately. However, in 
practice there was no difference as same experts were filtered in both cases. 


5.2 Experiment Set-Up 


For each of the three scenarios, we experimented with both intermediate-layer 
and last-layer repairs. We evaluated all the combination strategies (Naive, Con- 
fidence, Voting, and Merged) with the F1-filtering option being OFF and ON. 
When F1-filtering is OFF, the experts for all labels are used in the combination 
strategies by default, while when it is ON, we only include those experts whose 
F1 score on Train is greater than the original model. 


Intermediate-Layer Repair: We focused on the dense layer just before the output 
layer for both the MNIST and CIFAR models. The intuition for this selection is 
that dense layers appearing closer to the output potentially impact the classifi- 
cation decision more than convolutional layers closer to the input (which have 
the role of feature extraction). The MNIST models have 128 and 100 ReLU 
nodes and 576 and 400 incoming edges to each neuron at this layer respectively, 
while the CIFAR model has 512 ReLU neurons and 1,600 incoming edges at this 


1 https: //www.tensorflow.org/tutorials/generative/adversarial fgsm. 
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layer. We extracted high support patterns for correct classification; the average 
support per label was within 1,013-2,502 across scenarios, out of around 6,000 
inputs per label. The neurons short-listed in Nfauity using pattern-based local- 
ization varied between 1 and 10 in number. We focused on modifying the weights 
of the incoming edges having their scores within the top 10%. 

We used the patterns extracted at the layer to select a subset of tests for 
the purpose of constraint solving. As explained in Sect.4, we used decision- 
tree learning to extract patterns for correct classification for every label. We 
also extracted patterns for incorrect classification for each label, which represent 
neuron activations satisfied by inputs which should ideally be classified to the 
given label but get mis-classified. From the set of all failing tests for a given 
label, we select all inputs that satisfy the pattern for incorrect classification for 
the label. From the set of all passing tests for a given label, we select the subset 
of inputs that satisfy the pattern for correct classification. We then randomly 
select # failing tests + 100 inputs from this set. The subset of failing and passing 
tests selected using the procedure above is used for constraint solving. 


Last Layer Repair: At the last layer, the two MNIST models have 10 ReLU 
nodes and 128 and 100 incoming edges to each neuron respectively, while the 
CIFAR model has 10 ReLU neurons and 512 incoming edges. For each label, we 
selected 5 failing and 5 passing inputs randomly from the respective datasets. For 
the first scenario, both these failing and passing inputs come from the Train set. 
For the last layer repair top 5 suspicious weights were made symbolic for each 
expert. We determined empirically that a larger number for symbolic weights 
and/or passing/failing inputs leads often to unsat constraints while a smaller 
number may not improve the network. 

The poisoned (2) and adversarial (3) scenarios differ from scenario 1, in that 
they seek to address two challenges. The repaired model needs to have better 
accuracy than the original model on poisoned and adversarial inputs respectively 
(evaluated on the P-Test and Adv-Test sets), as well as the accuracy on normal 
inputs should not be degraded much (evaluated on the normal Test set). For this 
reason, for the purpose of constraint solving in addition to including passing tests 
from the respective poisoned and adversarial train sets, we also include passing 
tests from Train. We performed experiments increasing the number of passing 
tests included from the normal train set from 0 to 10, 50, and 100. 


5.3 Results 


Table 2 presents a summary of our results (please refer to the Appendix? for more 
detailed results). The table displays the results for MNIST and CIFAR models 
for the three scenarios. For each scenario, the results for both intermediate layer 
and last layer repair are presented in terms of the improvement in accuracy 
obtained over the original model. This is the best result corresponding to the 
improvement in accuracy on the respective test sets (normal Test for the first 


? https: //arxiv.org/abs/2103.12535. 
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scenario, P-Test for the second and Adv-Test for the third). The combination 
strategy and the F1-Filter setting (ON/OFF) used to obtain the best result are 
also displayed, along with the corresponding improvements in accuracy on the 
other train and test sets. For the repair, z3 was able to generate solutions for each 
expert within a minute. The constraint generation using SPF was the bottleneck 
and SPF generated constraints for each expert within 15-60 min, depending on 
the number of tests included. However, this could be improved since running 
SPF on all positive/negative inputs can be performed in parallel. Experiments 
were performed on a Windows 10.0 machine with Intel Core-i5 and 16 GB RAM. 
The code, constraint files along with Z3 solution files are available at https:// 
github.com/muhammadusman93/nnrepair. 

RQI1: For this research question we seek to investigate if NNREPAIR is suc- 
cessful in correcting defects in all three scenarios. To measure success, we consider 
the improvement in accuracy provided by the repair in all three scenarios. 

The effectiveness of NNREPAIR in improving accuracy (Scenario 1) can be 
analyzed by considering Table 2 (cases MNIST-LQ, MNIST-HQ and CIFAR10). 

We observe that the best results provided by NNREPAIR for the MNIST-LQ 
model was +0.20, +0.02 for the MNIST-HQ model and +0.16 for the CIFAR10 
model. This improvement (albeit small) was achieved without any new inputs or 
re-training. The quality of the improvement appears to degrade as the quality of 
the original model increases. We note that achieving improvement in the overall 
accuracy of an already high-quality model without new data is very challenging. 
In fact this improvement appears to be in line or better than related repair 
techniques (see Sect. 6). Note also that the complexity and size of the models do 
not seem to have an impact on the effectiveness of the repair. The MNIST-HQ 
architecture is simpler than MNIST-LQ and the CIFAR10 architecture is much 
bigger and more complex than the MNIST models. 

For Scenario 2, on the MNIST-Pois model, NNREPAIR increased the accu- 
racy from 10.38% to 55.94% on poisoned inputs (P-Test). The repair causes 
a slight decrease (—3.11) in accuracy on non-poisoned inputs (Test) but the 
repaired model still has a high accuracy (>95.5%) on non-poisoned inputs. On 
the more challenging CIFAR10-Pois model, the best improvement provided by 
NNREPAIR is a +3.77 increase on poisoned inputs, and a small decrease in accu- 
racy on non-poisoned inputs (—0.61). For Scenario 3, on the MNIST-Adv model, 
NNREPAIR increased the accuracy from 28.37% to 38.77% on adversarial inputs, 
while causing a small decrease (—3.14) in accuracy on non-adversarial inputs. 
For CIFAR10-Adv, the best result was an increase of +0.34 on adversarial inputs 
with a minor decrease of —0.07 on non-adversarial inputs. 

For the last two scenarios, the primary goal is to improve accuracy on poi- 
soned or adversarial data. Although ideally we would also want to preserve the 
original accuracy on normal data, this may not always be possible in practice. 
We experimented with varying number of passing tests from Train for scenarios 2 
and 3. The results are presented in the first table in the Appendix. The accuracy 
of the resulting repair on the poisoned/adversarial test sets tends to decrease as 
the number of normal passing tests goes up. However, this also reduces the 
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Table 2. Summary of NNREPAIR performance on all models. Repair column shows 
the type of repair, i.e., intermediate or last layer. Increase/decrease in accuracy shown 
in terms of the difference between the accuracy of the repaired model and the original 
model on the respective datasets. Accuracy of the original model is shown in brackets 
(in bold) below each data set. The Strategy column shows the combination strategy 
which work best for each scenario. ALL means that all strategies performed equally. 
F1-Filter shows if best results were obtained by turning F1-Filter ON or OFF. The 
number of experts used are shown in brackets. 


Model Repair | Increase/Decrease in accuracy Strategy F1-Filter 
Train Test 
(96.59%) (96.34%) 
MNIST-LQ Interm | +0.22 +0.20 Votes ON(3) 
MNIST-LQ Last +0.00 +0.00 ALL ON(0 
Train Test 
(99.81%) (98.89%) 

MNIST-HQ Interm | +0.01 +0.02 Merged ON(3) 
MNIST-HQ Last +0.00 +0.00 ALL ON(0) 
P-Train Test P-Test 

(98.99%) (98.63%) | (10.38%) 
MNIST-Pois Interm | +0.00 —0.01 +1.81 Votes ON(2) 
MNIST-Pois. |Last —2.60 —3.11 +45.56 Confidence | OFF 
Train Adv-Train | Test Adv-Test 
(98.67%) | (29.92%) | (97.87%) | (28.37%) 
MNIST-Adv Interm | —4.35 +2.75 —4.15 +3.87 Confidence | ON (9) 
MNIST-Adv. |Last —3.99 +11.15 —3.14 +10.40 Merged ON(10) 
Train Test 
(87.25%) (81.04%) 

CIFAR10 Interm | +0.03 +0.03 Merged ON(1) 
CIFAR10 Last +0.12 +0.16 ALL ON(1) 
P-Train Test P-Test 

(96.97%) (72.26%) | (15.89%) 
CIFAR10-Pois | Interm | +0.03 +0.02 +0.81 Merged ON(4) 
CIFAR10-Pois. | Last —0.89 —0.61 +3.77 Merged OFF 
Train Adv-Train | Test Adv-Test 
(87.25%) | (34.39%) (81.04%) | (35.96%) 
CIFAR10-Adv | Interm|+0.05 +0.22 —0.07 +0.34 Merged ON(10) 
CIFAR10-Adv. | Last —0.25 +0.37 —0.27 +0.27 Merged ON(10) 


degradation in the accuracy on normal test set. Previous studies in adversarial 
robustness [23] indicate that one can obtain robust networks but the price to 
be paid is a significant decrease in accuracy on normal data. Similar considera- 
tions apply to the poisoning case. Therefore, we tolerate small decrease in the 
accuracy on normal Test in our work as well. 

The last two columns in Table 2 list the combination strategies and the F1- 
filtering option which work best for each scenario. The Merged strategy seems 
to work well for the CIFAR10 model for all the three scenarios. However, there 
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is no clear winner for the MNIST models. In fact, for the last layer repair on 
CIFAR10, all the strategies gave the same improvement in accuracy. In practice, 
the users would need to use a separate validation set and try all the strategies 
to pick the best one for their application domain. 


Answer RQ1: NNREPAIR shows benefit in all three scenarios. It 
can repair a network to make it robust against adversarial perturba- 
tions/poisoned inputs while at the same time retain a good accuracy 


on the normal, unperturbed/non-poisoned test set. NNREPAIR can also 
improve the overall accuracy of the models, however the effectiveness of 
the repair tends to decrease when the original accuracy is already high. 


RQ2: Table 2 can be used to compare the performance of intermediate-layer 
and last-layer repair on the different scenarios. For the MNIST models, last-layer 
repair did not help in improving the overall accuracy. Repairing the dense layer 
before the output layer using the pattern-based repair helps in increasing the 
accuracy albeit by a small amount. For the CIFAR10 model, on the other hand, 
repairing the output layer increases the overall accuracy of the model by 0.16, 
which is better than intermediate-layer repair (+0.03). 

For the poisoned and adversarial scenarios, on the MNIST models, last-layer 
repair performed better than intermediate layer-repair on the targeted test sets. 
Intermediate-layer repair increased the accuracy by 1.81 on the poisoned model 
and 3.87 on the adversarial model while last-layer repair increased the accu- 
racy by 45.56 on the poisoned model and 10.40 on the adversarial model. For 
CIFAR10-Pois, intermediate layer repair increases the accuracy by 0.81 while 
last layer repair improves it by 3.77. Note that intermediate-layer repair seems 
to help better in retaining the accuracy on the standard Test, albeit providing 
smaller improvements on the target sets (detailed results in the Appendix). Fur- 
thermore, for CIFAR10-Adv, intermediate layer repair gives better results than 
last-layer repair (0.34 vs 0.27 respectively). 

To summarize, focusing only on an inner layer of the network or just the 
output layer may not suffice to correct errors in all models and scenarios. We 
plan to investigate application of repair at more than one layer. Fault localization 
approaches may help determine the layer/s to focus on for effective repair for a 
given application. 


Answer RQ2: Intermediate-layer repair helped more in improving the 
overall accuracy of the models (except for CIFAR10) and last-layer repair 
was more effective in repairing specific failures such as vulnerabilities to 


poisoned or adversarial inputs (except on CIFAR10 adversarial model). 
The take away is that there is not a specific type of repair (last-layer or 
intermediate layer) that works well consistently and different models and 
failure scenarios may necessitate repair at different layers. 
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RQ3: To understand the overhead introduced by running multiple experts 
and the combination logic, we conducted experiments on one of the models, 
MNIST-LQ. We executed the original model on the test set and compared the 
inference time with the model produced by a repair at an intermediate layer 
(i.e., layer 6) and by a repair at the final layer (i.e., layer 8). Additionally, we 
measured the inference time for an intermediate layer repair with F1-Filtering 
(i.e., layer6-F1). We performed this comparison for all 10,000 inputs in the test 
set. 

The Merge combination strategy does not require any expert combination 
after model execution because this strategy merges the repairs in advance. There- 
fore, there is no change with regard to the original model execution except the 
weight values used in the calculations, and we did not observe any difference in 
terms of the inference time. We focus the remaining discussion on the strategies 
that require the execution of multiple experts. Our experiments show that the 
time for the expert combination after model execution (as necessary for Naive, 
Confidence, and Voting combination strategies) is negligible with around 0.0008 
ms and also is similar for all these combination strategies. The main overhead 
is introduced by the additional calculations necessary to compute the multi- 
ple expert values at each layer. The box plot in Fig.4 shows the total time 
for the model execution for the experts inclusive the time for the Naive expert 
combination. 
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Fig. 4. Inference time comparison (Naive Combination Strategy) 


The repair at the last layer produces an average slowdown (compared to 
the original model) of 1.0383x. In contrast, the repair at the intermediate layer 
produces an average slowdown of 7.7638x. Therefore, it makes sense to apply 
some filtering of experts, which do not show good performance on the training 
set (see F1-score filtering in Sect. 4.3). For this experiment we kept 3 experts (see 
the plot with layer6-F1). This reduced model produces an average slowdown of 
only 3.0742x. 
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Answer RQ3: The Merge combination strategy does not impact the 
inference time. All other combination strategies introduce a similar over- 
head. While the inference time for the last-layer repair is comparable with 


the original model, the inference time for an intermediate-layer repair is 
expensive. However, it can be significantly reduced with F1 filtering. 


5.4 Discussion 


The purpose of our evaluation was to showcase the versatility of NNREPAIR in 
different scenarios. The takeaway from the experiments is that there is not a spe- 
cific type of repair (last-layer or intermediate layer) that works well consistently 
and different models and failure scenarios may necessitate repair at different lay- 
ers. In particular, we believe that the intermediate-layer repair holds the most 
promise for scaling to large networks and we plan to further experiment with 
the technique in the future. 

Generally, the best repair results are obtained on the poisoning task, where 
the accuracy can be increased by up to 45% and 3.7% on MNIST and CIFAR10, 
without a need for retraining, which can be expensive in practice. Furthermore, 
note that we do not assume knowledge of the poison, as our techniques only use 
information about correct and incorrect classification. In the future, we plan to 
perform more experiments with different poisoning scenarios. 

We were able to obtain modest accuracy improvements on the high-quality 
models, while for the low-quality models, re-training can achieve better results 
(see comparison with MODE in the next section). More experimental comparison 
with retraining and/or fine-tuning the models is needed to further assess the 
merits of our constraint-based repair. 

The gains in the adversarial setting are not very significant for the larger 
models. In this work, our goal was to demonstrate the feasibility of using local- 
ized constraints solving as a generic technique for addressing a wide range of 
challenges in deep learning. Adversarial attack is only one potential application 
scenario that is considered. There is a large body of research work on adversarial 
attacks and we can not claim in any way that we can cover all attacks. 

We also note that the efficacy of NNREPAIR is evaluated statistically (over 
the test set) as our method does not provide any formal guarantees. In general, it 
is difficult to guarantee an improvement of the overall accuracy with formalisms, 
as there are no formal specifications for the image classification domain. Thus, 
in practice one builds (trains) a model using a statistical measure of accuracy. 
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6 Related Work 


The emphasis of this paper is on neural network repair, where the goal is to 
“correct” the neural network and improve its performance, robustness and secu- 
rity, by using a small number of labeled inputs. There have been relatively few 
attempts for repairing a neural network. These neural network repair works can 
be classified given if re-training is needed and/or if there is a first step to prior- 
itize neuron weights to fix. A number of fix patterns and challenges for neural 
network repair were collected in [9]. 

In MODE [12], a neural network is said to be buggy for a specific output 
label if its test accuracy is lower than the expectation. This is fixed by selecting 
features that are critical for the misbehavior via differential analysis using a 
subset of training data and then retraining by selecting inputs from the remaining 
unused training inputs based on the differential heat map. We ran MODE on 
the MNIST models from our study. The results are as follows: 


Model | Test Acc. (%) 
MNIST-LQ | +0.37 
MNIST-HQ —0.40 


NNREPAIR has similar performance, i.e., slightly better than MODE on 
MNIST-HighQuality and slightly worse on MNIST-LowQuality. Meanwhile, the 
re-training procedure in MODE led to varied performances for the repaired 
model. The results for MODE are the average outcome after 10 runs, none of 
which improved the accuracy of MNIST-HighQuality. 

Unlike MODE that identifies ill-trained weights or buggy neurons, Apricot 
[24] first generates a set of models from the original neural network with a 
reduced set of training data and at each iteration of the training, Apricot adjusts 
each weight of the repaired model towards the average weight of these reduced 
models correctly classifying the input while away from the misclassifications. 
The approach from [19] uses constraint solving for repairing neural networks. It 
considers a two-dimension slice of the input space of ACAS Xu and uses SMT 
constraints to achieve weight changes for correct cases that are checked against 
the specification. We found it non-trivial to extend this approach to typically 
high-dimensional input space of the image classifiers that we study in this paper. 

Typically, a software repair technique (including for neural networks) employs 
as a first step fault localization to determine the code entities that need to be 
fixed. DeepFault [2] is an approach to spectrum-based fault localization that aims 
to identify the neurons that are ‘more’ responsible to adversarial behaviours of a 
neural network. However, the aim of DeepFault is to generate more adversarial 
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examples, which is the opposite to the repair purpose of our paper. Another 
related approach, Arachne [18], uses fault localization to identify neural weights 
(connected to the final output layer) to modify, using Particle Swarm Opti- 
misation (PSO), for better weights to improve the model’s accuracy on some 
particular label. As also noted in [18], increasing the prediction accuracy for a 
particular label often comes along with the decreasing prediction accuracy of the 
overall neural network model. 

Our NNREPAIR work provides a general repair approach which can be applied 
for improving accuracy, enhancing robustness against adversarial attacks and 
fixing the backdoor security problems for neural networks. Although previous 
techniques could be presumably extended to these scenarios, in practice they 
were only demonstrated for improving the prediction accuracy of the neural 
network (in MODE and Apricot) or a particular label (in Arachne). 


7 Conclusion and Future Work 


We presented NNREPAIR, which uses constraint solving for intermediate-layer 
and last-layer repair of neural networks. We demonstrated NNREPAIR in three 
scenarios: improving the overall accuracy, fixing security vulnerabilities caused 
by data poisoning and improving the adversarial robustness of the networks. 

In future work, we plan to experiment with different localization techniques 
and to evaluate our repair on larger networks and different architectures. Our 
method can also be applied to multiple layers but we restricted to single-layer 
for scalability. One avenue for research is to apply single-layer repair repeatedly 
or compositionally to handle correcting bugs across multiple layers. 
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Abstract. Formal methods are becoming an indispensable part of the 
design process in software and hardware industry. It takes robust tools 
and proofs to make formal validation of large scale projects reliable. 
In this paper, we will describe the current status of formal verifica- 
tion at Centaur Technology. We will explain our challenges and our 
methodology—how various proofs and verification artifacts are intercon- 
nected and how we keep them consistent over the duration of a project. 
We also describe our main engine—a powerful symbolic simulator with 
rewriting capabilities that is integrated in a theorem prover and proven 
correct. 
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1 Introduction 


The discussion of Formal Verification (FV) of software and hardware three 
decades ago was mostly about case studies or proofs of concept that required 
a lot of manual effort by researchers. Since then, FV has taken a transforma- 
tional journey that has resulted in highly automated tools—equivalence checkers, 
model checkers, SMT solvers, and theorem provers. Large scale formal verifica- 
tion projects were first reported by hardware companies around ten years ago, 
e.g. Intel [28], IBM [36], ARM [34], and Centaur Technology [18,37]. Success sto- 
ries of FV at software development companies followed. To name just a few, see 
Peter O’Hearn’s keynote at PLDI 2020 conference about incorrectness logic and 
static analysis his group applies at Facebook [30], David Dill’s keynote at CAV 
2020 about the Libra project at Facebook [19] and their use of the Move Prover 
[44], or the invited talk by Byron Cook at CAV 2018 about the application of 
formal methods at Amazon Web Services [16]. Formal methods are becoming 
a reliable and indispensable part of the design process in the commercial soft- 
ware and hardware industries. This newly elevated position of formal verification 
brings new responsibilities for those that develop tools and methods and those 
who build proofs. FV teams face various challenges: 
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e Tools and libraries used by FV teams are expected to be reliable and main- 
tainable. 

e FV teams get involved much sooner in a project cycle, often starting with an 
incomplete design, and they are expected to give feedback quickly. 

e Designs under FV scrutiny are being continuously changed by several design- 
ers at a time. 

e Specifications change during the process as designers get feedback from back- 
end tools, or due to the changes in the target market. 

e The scope and depth of proofs change as development continues. 

e An FV team might be working on several proliferations of a project with 
overlapping schedules. 


These challenges can only be solved by building robust expandable proofs. In 
this paper, we will describe the approach taken by our FV team at Centaur Tech- 
nology. Centaur is a relatively small company, of about one hundred employees, 
that designs x86 compatible microprocessors, focusing on the low cost, low power 
market. It might surprise many that our formal verification tools are based on a 
theorem prover. This is only possible because the theorem prover we use, ACL2 
[8], has been designed with industrial applications in mind [24]. ACL2 has been 
successfully used not only at our company but also at many others: e.g. ARM, 
AMD [35], IBM [36], Rockwell-Collins [22], and Oracle [32]. All our proofs are 
done within the ACL2 system. ACL2 is used to write specifications, models, 
tools, and tests, as well as to generate documentation. Two features of ACL2 
that are crucial to our work are fast execution and extensibility. Our x86 model 
[20] is not only one of most complete of its kind, but is capable of executing 
application programs at a speed of around 3 million instructions per second. 

We will start with a brief description of the ACL2 system and the features 
that make it a good choice for a verification framework (Sect. 2). The reflective 
features of ACL2 allow us to build verified tools within the system. One such tool 
is FGL [39], our symbolic simulator equipped with rewriting capabilities. FGL 
is completely integrated into ACL2 as a verified clause processor. It provides 
a desirable balance between automation and user guidance. We will describe 
its mechanism in detail in Sect. 4. We also explain its usability as a highly pro- 
grammable solver that is capable of proving complex conjectures about Register- 
Transfer Level (RTL) design and microcode in Sect.3. FGL and its use within 
our framework are primary contributions of the work presented in this paper. 
The challenges enumerated above are illustrated with the process of verifica- 
tion for a single x86 instruction. We explain the complex interconnection of the 
various parts of the proofs, and describe how they are built and maintained. 


2 Our FV Tools 


All formal verification at Centaur is done within the framework of ACL2 [8]. 
ACL2 is an untyped language (a subset of Common Lisp) and a theorem prover 
that supports first-order logic as expressed in this language. ACL2 also has 
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some limited support for higher-order style definitions [29]. ACL2 is an open 
source software project that has an active community contributing to an exten- 
sive library of proofs and utilities. Centaur has contributed to many libraries that 
support hardware verification, including support for translating Verilog and Sys- 
tem Verilog to ACL2 expressions [7, 10] and libraries that support bit operations. 
ACL2 provides an interface through which it can be connected to trusted tools 
such as SAT solvers. There is also an integration of Z3 in ACL2 [5] and an 
interface to the ABC model checker [1, 14]. 

Besides interfaces to trusted tools, ACL2 has a mechanism for extending its 
reasoning by admitting verified clause-processors [2]. We use this feature in sev- 
eral ways, notably for SVL [43], a routine that automates verification of multipli- 
ers, and for FGL, the core tool that provides automation for our microoperation 
execution and microcode proofs. 

FGL, briefly, is a term rewriter geared toward transforming expressions act- 
ing on fixed-sized data into Boolean formulas. For example, a specification for an 
x86 instruction may be written in high-level ACL2. Processing a call of this spec- 
ification function on variable arguments in FGL yields a result that expresses 
each of the bits of the writeback data, flags, etc., as a Boolean formula (rep- 
resented in an and-inverter graph) whose inputs are the symbolic bits of the 
input variables. Similarly, FGL processing of the ACL2 model of the microcoded 
implementation for that instruction yields Boolean formula representations of 
the implementation’s outputs. Equivalence checking these two sets of Boolean 
formulas is then sufficient to show that the implementation result matches the 
specification. We describe the FGL system in more detail in Sect.4, showing 
how it transforms terms into hybrid term/Boolean-function objects and how its 
behavior may be programmed with rewrite rules. 


3 Challenges of Verifying a Single x86 instruction 


An intuitive notion of the functional correctness of a microprocessor is that any 
sequence of bytes decoded as instructions either executes correctly or leads to 
an exception if byte sequence is illegal. For the x86 instruction set, parsing and 
decoding a sequence of bytes is a complex process due to the many instruction 
formats with varying lengths and field types. The Intel 64 and [A-32 Instruc- 
tion Set Architecture (ISA) is defined by the Software Developer’s Manuals [27], 
which have thousands of pages describing the expected impact of every instruc- 
tion on the state of the machine. It is a living and growing specification, with new 
instructions and variants added constantly. The architectural specification does 
not dictate how the ISA is supposed to be implemented. Various implementation- 
specific choices, collectively called the microarchitecture, include: 


— how memory is organized 

— how an instruction is decoded into a sequence of microoperations 
the set of microoperations implemented in hardware 

— the throughput and latency of microoperations and instructions 
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and various others features of the microprocessor. In our previous work [21], 
we described what it means for an x86 instruction to be decoded and executed 
correctly and how our proofs capture this property. For illustrative purposes, we 
use the same example that was described in that work. Table 1 describes the x86 
double-precision shift right instruction SHRD and Table 2 shows the microopera- 
tions that implement it!. In this paper, we will recall the individual steps of the 
verification with a different purpose—to discuss the challenges in each step and 
how we deal with them. In particular, we will focus on increasing the automation 
and reducing the time required of engineers to catch and debug problems while 
maintaining the proofs. 

In the process of verification, we refer to two sets of formal specifications: 
the architectural specification of x86 [20] and a microarchitectural specification, 
which is a proprietary IP of Centaur and unique to each project. We refer to the 
former as the 786 model and the latter as the microcode model. Both of these 
models are written in ACL2 following an interpreter-style operational semantics 
approach. The x86 model includes the specification of x86 instructions that oper- 
ate on the ISA state, and analogously, the microcode model includes the specifi- 
cations of microoperations that operate on the microarchitectural state. Thanks 
to the high execution speed of the x86 model, it can be validated by running 
extensive code. The microcode model is directly compared to the RTL implemen- 
tation. In addition, for data-intensive operations like floating-point arithmetic, 
we have the ability to run our models against existing x86 hardware from Intel 
and AMD. Again, the efficient execution of ACL2 code is crucial for the valida- 
tion of these models. 

Our verification is done on the Register-Transfer Level (RTL) of microproces- 
sor design. We have two goals: to confirm that the RTL behaves as specified by 
our microarchitectural specification and to show that it implements instructions 
correctly with respect to our architectural specification. 


Table 1. SHRD--Double Precision Shift Right: irrelevant fields elided 


Opcode Instruction Description 


REX.W + OF AC /r ib| SHRD r/m64, r64, imm8 | Shift r/m64 to right 
imm8 places while 
shifting bits from r64 
in from left 


3.1 Front-End and Microcode Verification 


The front-end of a microprocessor fetches, decodes, and then translates a 
sequence of bytes into a sequence of microoperations. For a modern x86 pro- 
cessor, this is one of the more complicated parts of the design. Writing and 


1 Note that this is not the actual implementation of SHRD in our current design. 
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Table 2. SHRD RCX, RDX, imm8: a concrete run 


Initial values 


RDX := 0x1122_3344_5566_7788 
RCX := 0x0123_4567_89AB_CDEF 
imm8 := 16 


Expected values 


RDX := 0x1122_3344_5566_7788 
RCX := 0x7788_0123_4567_89AB 


UOPs from front-end 


Concrete Run & Description 


MOVSX G2, RCX 
(SSZ: 64; DSZ: 64) 


G2 < 0x0123_4567_89AB_CDEF 
Move RCX to internal register G2 


MOVZX G3, <imm8> 
(SSZ: 8; DSZ: 64) 


G3 — 16 
Move immediate to internal register G3 


UOPs in ROM 


Concrete Run & Description 


AND G3, G3, 63 
(SSZ: 8; DSZ: 64) 


G3 — 16 
Mask immediate operand 


MOV G10, -1 
(SSZ: 64; DSZ: 64) 


G10 < OxFFFF_FFFF_FFFF_FFFF 
Move -1 to internal register G10 


JE G3, 0, ent_nop 
(SSZ: 16; DSZ: 16) 


No jump taken 
Jump to routine ent_nop if G3 == 0 


SUB G5, 0, G3 
(SSZ: 32; DSZ: 32) 


G5 — OxFFFF_FFFO; ZF <— 0 
Store -G3 in internal register G5; 
clear the zero flag because result is non-zero 


SHR«izr> G10, G10, G5 
(SSZ: 64; DSZ: 64) 


G10 «+ OxFFFF 
Shift G10 right by (G5 & 63) if ZF == 0 


ANDzr> G10, G10, O 
(SSZ: 64; DSZ: 64) 


G10 — OxFFFF 
Set G10 to 0 if ZF == 1 


AND G6, RDX, G10 
(SSZ: 64; DSZ: 64) 


G6 < 0x7788 
Store (RDX & G10) in internal register G6 


SHR G7, G2, G3 
(SSZ: 64; DSZ: 64) 


G7 < 0x0000_0123_4567_89AB 
Store (G2 » G3) in G7 


SHL G2, G7, G3 
(SSZ: 64; DSZ: 64) 


G2 — 0x0123_4567_89AB_0000 
Store (G7 « G3) in G2 


OR G2, G2, G6 
(SSZ: 64; DSZ: 64) 


G2 — 0x0123_4567_89AB_7788 
Store (G2 | G6) in G2 


ROR G7, G2, G3 
(SSZ: 64; DSZ: 64) 


G7 < 0x7788_0123_4567_89AB 
Rotate G2 right by G3 and store result in G7 


OR RCX, G7, G7 
(SSZ: 64; DSZ: 64) 


RCX < 0x7788_0123_4567_89AB 
Store the result of G7 | G7 in RCX 


maintaining a formal specification for it would be impractical. The readability 
and complexity of such a specification would be similar to that of the implemen- 
tation itself. How, then, do we go about its verification? We have one methodol- 
ogy to verify the decoding of byte sequences into legal /illegal instructions (with 
appropriate exceptions), and another one to show that legal instructions are 


implemented correctly via microoperations. 


Balancing Automation and Control for FV of Microprocessors 31 


Listing 1.1. SHRD entry in inst.1st 


(xINST "SHRD" 
(OP :OP #xFAC) 
(ARG :OP1 '(:MODR/M.R/M :GPR :MEM) 
:0P2 '(:MODR/M.REG :GPR) 
:0P3 '(:IMM8)) 
' (X86 -SHLD/SHRD) 
'((:UD (UD-LOCK-USED)))) 


For illegal instructions, we make sure that all sequences of bytes that do 
not decode into a sequence of legal instructions are recognized as illegal and 
we verify that an appropriate exception is signaled. This is done by simulating 
the front-end on a symbolic sequence of bytes and proving that any input that 
does not map to a legal opcode (as defined by the decode specification in our 
x86 model) produces an exception. The decode specification in the x86 model 
relies heavily on inst.1lst—a data structure defined by us that captures all 
the information needed to decode every x86 instruction. The initial version of 
inst.1st was mechanically extracted from the Intel manuals (Chaps. 3-5, Vol. 
2) [27] by parsing the tables in the description pages of each instruction and 
transforming the contents into an ACL2-readable format. For instance, for the 
implementation in Table 2, the relevant entry in the Intel manuals is in Table 1 
and that in inst .1st is in Listing 1.1. Since then, inst.1st has been inspected, 
enhanced, and validated against internal and external x86 decoders. 

Next we focus on our process for verifying legal instructions. For each instruc- 
tion, our goal is to prove that for any starting machine state and for any 
byte sequence representing a legal invocation of that instruction in that state, 
the front-end produces a sequence of microoperations which, when run on our 
microcode model, produce the same results as the instruction run on our x86 
model?. To prove this, we simulate the front-end to generate the correspond- 
ing sequence of microoperations. Using FGL, we then prove that the sequence 
implements the instruction as defined by our x86 specification. FGL symboli- 
cally processes the sequence of microoperations as executed on our microcode 
model, resulting in a symbolic machine state where the bits of the written reg- 
isters are represented as Boolean formulas in terms of the values read from the 
initial state. It likewise processes the instruction specification, reducing it to 
Boolean formulas as well. We can then show by Boolean equivalence checking 
that the front-end-generated sequence of microoperations has the same effect on 
the state as the x86 instruction specification. We discuss the process of symbolic 
simulation of microcode by FGL in Sect.4. This FGL proof confirms that the 
front-end’s operation is correct for this particular instruction. 

This correctness has two caveats. First, it assumes that the individual micro- 
operations are correctly implemented, i.e., in accordance with their specifica- 
tions in our microcode model. Second, in the case of out-of-order processors, 
if the microoperations are executed in a different order, that sequence needs 
to be compared to the sequence generated by the front-end. Currently, we can 


2 Note that the microcode model is a proprietary formal model of the microarchitecture 
implemented by the design. Its validation is discussed later. 
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ensure only the former—for most microoperations, we have proved that their 
implementations in the processor’s execution units matches their behavior in 
our microcode model; we discuss this further in Sect. 3.2. However, the latter— 
the correctness of reordering of the microoperations—is work planned for the 
future. 

There is another part to the verification story. The front-end generates only 
sequences of microoperations of a limited length. Some instructions are complex 
and require much longer sequences (e.g. instructions performing transcendental 
or cryptographic functions). For these instructions, the sequence of microopera- 
tions generated by the front-end is just the beginning of the microcode program. 
The rest is stored in a ROM and the front-end generates the entry-point of this 
code. That means our verification has to account for those microcode routines. 

ROM instructions are more complex than microoperations and they may 
also be compressed in order to save valuable ROM space. As in the front-end, 
the specification of this compression and decoding of ROM instructions into 
sequences of microoperations is complex and also changes during the design 
process. Even if we could define it formally, the maintenance of such a specifica- 
tion would be very time consuming. Instead, we do the same trick as with the 
front-end—we symbolically simulate the part of the design that fetches ROM 
instructions and translates them into a sequence of microoperations. The rest 
is done similarly as for the sequence of operations generated by the front-end. 
These proofs implicitly verify the correctness of fetching from ROM and ROM 
instruction translation. We call this implicit verification because we do not have 
an explicit specification of the translation and fetching. However, we do have 
formal specifications of the instructions implemented by the microcode. There- 
fore, the proof of correctness of the instructions implies the correctness of the 
underlying design, including ROM fetching and translation. In other words, we 
can verify some parts of the design as black-boxes, without knowing exactly 
how they work, by reasoning about the overall observable effect on the machine 
state. The main advantage of this type of verification of both the front-end and 
microcode translator is that the maintenance of the proofs does not require 
either deep understanding of the design or writing and maintaining cumbersome 
specifications. 

The microcode sequences generated from x86 instructions that we encoun- 
tered so far were in the style of straight-line code. We do not expect this to be 
the case for all of them. In the past, we worked on some microcode stored in 
ROM that served other purposes [18]. This code had loops and jumps between 
loops and we were able to do invariant-style proofs. Our main problem at that 
time was that the proofs were not robust enough and very hard to maintain. Now 
we are in a much better position, having FGL and a methodology that keeps 
the microcode model in sync with the design. Hence, we are optimistic about 
our ability to bring the verification of most, if not all, legal x86 instructions to 
completion. 


Balancing Automation and Control for FV of Microprocessors 33 


Finally, we note that in our previous work [21], we used GL—the predecessor 
of FGL—as our core verification tool. The benefits of switching to FGL have 
been considerable. GL had limited support for term rewriting, as a result of 
which symbolic simulation of the microcode model was difficult and debugging 
failed proof attempts even more so. As such, instead of programming GL to 
deal with symbolic machine states, we usually used ACL2’s rewriter to “open 
up” the microcode model and played to GL’s strengths by using it for the final 
equivalence proof that often required non-trivial arithmetic reasoning. In other 
words, we obtained ACL2 formulas corresponding to the written registers in 
terms of the values in the initial microcode state, and then used GL to prove 
that those formulas were equivalent to our specification functions. FGL easily 
allows us to do these tasks (symbolic simulation and equivalence checking) along 
with others within a common environment and thereby reduces overhead in our 
methodology. 


3.2 Verification of Execution Units 


Everything that was said in Sect. 3.1 relies on the assumption that our microcode 
model is correct. Parts of that model—front-end decoding and ROM instruction 
fetch and translate—are implicitly verified. The other part, definitions of micro- 
operations that form the base of the model, were explicitly defined in ACL2 and 
need to be validated. A large portion of our work lies in the proofs that confirm 
that RTL executes the microoperations in compliance with those specifications. 
In order to achieve that, we build a formal model of the respective RTL module 
[7], unroll it with respect to the latencies of the microoperations to be verified 
[6] and check conformance with the specification using FGL. 

These microoperations are executed in various units, the number, timing, and 
organization of which differs based on the specific microarchitecture. We might 
have separate floating-point add and floating-point multiply units, or one unit 
that executes both. There might be a unit that implements string operations, 
another that implements integer operations, and yet another one devoted to 
SIMD operations, etc. The scope of proofs that confirm correctness of execution 
of microoperations is dictated by the capacity of the tools we use. During the 
first years of FV at Centaur, we limited the proof for each microoperation to 
the specific unit where it was executed [25,26,37]. Since then, improvements 
to our RTL modeling and symbolic simulation (i.e., FGL) allow us to do the 
proofs in the scope of the module containing all those units (we refer to that 
module as the execution module or EXE) [21]. Migration to a higher scope has 
a huge advantage for the stability of the proofs. First, proofs are robust with 
respect to the changes of the interfaces of submodules in EXE. For instance, when 
an interface of a floating-point sub-unit changes to accommodate extra control 
signals that simplify its logic, very likely the change is transparent to the input- 
output behavior of EXE and will not effect our proofs. Second, if timing of an 
internal unit changes, but overall timing of the EXE module does not, that is 
transparent to the proofs. 
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Having all microoperation proofs in the same scope has another advantage— 
we can build just one formal model of the RTL, do one unrolling to the maximum 
latency, and store it as a constant that can be shared and loaded by individual 
proofs. A review of the assumptions about the interface and the maintenance of 
the assumptions is also simplified when all the proofs are done with respect to 
one module. 


3.3 Regressions 


Regressions have become an indispensable part of the continuous integration. 
There are several reasons why we need to re-run our proofs regularly. Since 
we start to build our proofs early in the design process, design changes occur 
regularly and can introduce bugs that we need to catch. But proofs can be 
broken not only due to changes in the design but also because of changes in 
the specifications, tools, and libraries. While the ISA specification is relatively 
stable, the microarchitecture specification might change during the project as 
a result of feedback from back-end tools or better ideas from the designers. 
Proofs might also change as the design becomes more mature and we add more 
thorough checks. While the core ACL2 theorem prover is very stable, ACL2 
libraries are growing and may be modified by developers outside our team. All 
of these verification artifacts are tightly interconnected and regressions ensure 
that we keep them consistent. 

When a proof of the correctness of a microoperation fails, there are several 
possible reasons: 


— There is a bug in the RTL design. 

— There was a change in the design (interface or timing) that our proofs need 
to take into account. 

— The specification of the microoperation changed; e.g. some flags indicate a 
new intended use, or a portion of the result became “don’t care”. 


We need to investigate the reason for failure and either report a bug to 
designers, adjust the proofs, or change the specification of the microoperation. 

When we change the specification of a microoperation, the new definition 
will then be used by our microcode proofs. If those fail, it may indicate that the 
change affected some instruction implementations in an undesirable way. In other 
cases, the failure might be a result of missing rewrite rules. Microcode proofs 
might also fail due to the changes to front-end design or fetch and translate from 
ROM that introduced a new bug. 

Regressions can be scheduled for a specific frequency (daily, weekly, etc.), 
run manually, or triggered by changes in the design, specification, or tool suite. 
We use open-source tools like git and Jenkins, and ACL2-specific scripts that 
compute dependencies on ACL2 files. Regressions also automatically generate 
a documentation manual from our ACL2 proof scripts [17]. This documenta- 
tion includes information about which proofs failed and which succeeded and 
as the result of it, which microoperations and instructions are covered by the 
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successful proofs. This keeps the documentation in sync with the design as well 
as the proofs. We tag individual documentation topics to indicate their intended 
audience; e.g., the General Audience tag is used when an overview of a verifica- 
tion effort is presented, and the FV Audience tag is used when describing proof 
strategies and verification tools. 


4 FGL 


Since FGL is the core proof engine used in our microcode and execution unit 
proofs, we will describe here how it works and how it may be programmed. 

FGL [4,39] is part of the ACL2 libraries and publicly available [9]. It is a 
significant rewrite and extension of GL (“G in the Logic”) [38,41,42], which was 
itself a rewrite and extension of the G System of Boyer and Hunt [13]. The idea 
behind all of these is to recursively transform ACL2 terms into symbolic objects 
that represent the values of these terms and that consist mostly of structures 
containing Boolean function objects. When successful, the result of transforming 
the body of a conjecture is a single Boolean function, which may be checked 
for validity. The G System supported Boolean functions represented as binary 
decision diagrams [15], and operated on symbolic input objects using symbolic 
counterpart functions derived mechanically from function definitions. GL used 
an interpreter to capture function behavior rather than translating definitions, 
and added support for an and-inverter graph (AIG) representation for Boolean 
functions along with links to external SAT solvers for resolving Boolean function 
validity. Later changes in GL added preliminary support for rewrite rules and 
termlike symbolic objects so as to allow for some abstraction. 

FGL continues the trend toward user-definable rules displacing built-in 
behavior. It is a rewriter at its core, so user-defined rewrite rules are the basis 
of its reasoning system, rather than an add-on. Nevertheless, it comes with an 
extensive library of rules that replicates the automation provided by GL. Rewrite 
rules supported by FGL offer powerful capabilities such as programmable binding 
of free variables and visibility into the syntax of the rewriting targets [39]. FGL 
also replaces built-in primitive function symbolic counterparts with meta rules 
similar in spirit to ACL2’s [23], which similarly allow directly programmable 
manipulation of the syntax of objects but may also be added by users. FGL 
adds support for incremental SAT, allowing multiple SAT checks of related for- 
mulas to share learned clauses and heuristic information. It also allows global 
simplification of the entire AIG using combinational circuit simplification meth- 
ods. Both of these features may be invoked from within rewrite rules; e.g., if 
the author of a rewrite rule judges that a hypothesis of the rule is unlikely to 
be solved by rewriting alone, they may specify that incremental SAT should be 
used to prove it. 

Many other projects have also aimed to allow interactive theorem provers 
to call on automatic decision procedures; too many such efforts exist to list 
them all. In higher-order logic proof assistants, several tools collectively called 
hammers translate queries into the language of an automated theorem prover 
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Listing 1.2. Semantics of a machine instruction 


(defun run-inst (inst st) 
(let* ((Cinstname (first inst)) 
(args (rest inst)) 
(x (first args)) 
(y (second args)) 
(ans (case instname 


(const y) 
(copy (get-st-reg y st)) 
(add (+ (get-st-reg x st) 
(get-st-reg y st))) 
(and (bitwise-and (get-st-reg x st) 
(get-st-reg y st))) 
(rshift (right-shift (get-st-reg y st) 
(get-st-reg x st)))))) 


(set-st-reg x ans st))) 


Listing 1.3. Semantics of a straight-line code block 


(defun run-prog (insts st) 
(if (atom insts) 
st 
(let ((st (run-inst (first insts) st))) 
(run-prog (rest insts) st)))) 


and then translate the emitted proof back into a form acceptable by the original 
prover [12]. Several decision procedure integrations have also been carried out 
in ACL2. Reeber and Hunt [33] identified a decidable subclass of ACL2 list 
formulas and contributed a decision procedure that transforms such a formula 
into a SAT problem. Peng and Greenstreet [31] process a subclass of ACL2 
formulas including integer and rational arithmetic, uninterpreted functions, and 
algebraic data structures, converting such problems to SMT queries. FGL differs 
by focusing on the efficient integration of user-extendible term rewriting and 
Boolean simplification and decision procedures. 


4.1 Example 


We describe how FGL works at a high level by running through an example, 
the code of which is publicly available [40]. We define a simple machine model 
(Listings 1.2, 1.3) that has 16 32-bit registers and a few instructions defined, 
and use those instructions to implement (in straight-line code) an optimized 
routine to count the number of bits set in a 32-bit input (Listing 1.4), similar to 
implementations in Bit Twiddling Hacks [11]. We also define a straightforward 
ACL2 specification count-bits for the bit count operation (Listing 1.5). We 
prove that for any initial state, if we run this program on the machine, then the 
resulting state has its register 0 value equal to the count-bits of the value that 
was in register 0 before running the program (Listing 1.6). 

The invocation of def-fgl-thm in Listing 1.6 causes the FGL rewriter to be 
applied to the conjecture. It begins by descending into the term and applying 
rewrite rules to subterms from the inside out. In many cases, these rules are just 
the definitional formulas of the functions we have introduced; for example, the 
definitions of run-prog, run-inst, and count-bits are used as rewrite rules, so 
that calls of these functions are replaced by their bodies. Rewriting the term 
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Listing 1.4. BITCOUNT program listing 


(defconst *bitcount* 


'((copy 10 0) ;; copy the operand to regs 10 and 11 
(copy 11 0) 
(const 5 #x55555555) ;; set reg 5 to the mask 
(and 10 5) 33; bitand the operand with the mask 
(const 0 1) 33 set reg 0 to 1 
(rshift 11 0) ;; right shift the operand by 1 
(and 11 5) 33; mask the shifted operand 


(const 0 #x003f) 
(and 10 0) 33 mask the relevant bits of the result 
(copy 0 10))) ;; move the result to reg 0. 


Listing 1.5. count-bits specification function 


(defun count-bits (x) 
(if (or (not (integerp x)) (<= x 0)) 
0 
(+ (nth-bit 0 x) 
(count-bits (right-shift 1 x))))) 


while opening such definitions effectively conducts a symbolic simulation of the 
program and its specification. For some functions, it is preferable to avoid open- 
ing the definitions and instead use rules that rely on particular properties to 
simplify combinations of calls; for example, Listing 1.7 shows a rule that simpli- 
fies a read of a write of the machine state’s register file.’ 

Rather than producing a new term as the result of rewriting each subterm, 
the FGL rewriter produces hybrid structures we call symbolic objects that may 
(like terms) contain function calls, variable references, and constants, but (unlike 
terms) also may contain symbolic Booleans, represented by a reference into an 
AIG defining a Boolean function, and symbolic integers, represented by a list 
of references into the AIG giving the two’s-complement bits. Table 3 lists the 
variants of symbolic objects. 

In order to prove this conjecture, we aim for the result of rewriting the con- 
jecture to be a symbolic Boolean, which can then be proved valid by encoding 
its negation as a SAT problem. We therefore want to compute a Boolean for- 
mula equivalent to the equal comparison of the specification and implementation 
results. Working backwards from this goal, we can obtain this if we can repre- 
sent the specification and implementation results as symbolic integers; the equal 


Listing 1.6. Correctness theorem for BITCOUNT 


(def -fgl-thm bitcount -implements -count -bits 
(let* ((input (get-st-reg 0 st)) 
(final-st (run-prog *bitcount* st)) 
(result (get-st-reg 0 final-st))) 
(equal result (count-bits input)))) 


3 Since ACL2 is an untyped language, functions have well-defined behavior even on ill- 
typed inputs. The uses of zero-extend in this rule reflect the choice of the definitions 
to coerce integers that don’t fit in the allotted space into well-typed values by zero- 
extending them. 
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Listing 1.7. Read-over-write rule for get-st-reg 


(def-fgl-rewrite get-st-reg-of-set-st-reg 
(equal (get-st-reg i (set-st-reg j v st)) 
(if (equal (zero-extend 4 i) (zero-extend 4 j)) 
(zero-extend 32 v) 
(get-st-reg i st)))) 


Table 3. Symbolic object variants 


— (g-boolean lit) represents a Boolean, t or nil, as an AIG literal, lit 

— (g-integer lit lit1 ...) represents an integer as a list of AIG literals giving the two’s-complement bits, least-significant first 
— (g-concrete obj) represents the constant obj itself 

— (g-apply fn args) represents a function application, where fn is a function symbol and args is a list of symbolic objects 

— (g-var name) represents a variable named name 

— (g-ite test then else) represents an if-then-else, where the three arguments are symbolic objects 

— (g-cons car cdr) represents a cons pair, where the two arguments are symbolic objects 

— (g-map tag alist) represents a table of key/value pairs with constant keys and symbolic values, 


supporting fast lookups (see ACL2 documentation on fast alists [3]) 


comparison of these is the conjunction of the Boolean equivalences between all 
the corresponding bits. Working further backwards, we’ll find that we can simi- 
larly compute these values given the bits of the intermediate integer values from 
which they are computed, etc., back to the original values that are components 
of the free variables of the conjecture. That is, generally speaking, we wish to 
represent every intermediate integer value as a symbolic integer. In the next 
two sections we will describe how to extract Boolean variables from the initial 
variables of the conjecture (Sect. 4.2) and how to build up Boolean formulas to 
represent the bits of intermediate values (Sect. 4.3). 


4.2 Extracting Boolean Variables 


When rewriting a term in a Boolean context such as the test of an if expression, 
FGL will coerce the rewritten result to a symbolic Boolean object. The symbolic 
Boolean values of symbolic object types other than function calls and variables 
are easy to determine; for example, integers are non-nil and therefore considered 
true in ACL2. For function call and variable results, this coercion is accomplished 
by assigning a Boolean variable to the object, either a fresh one—a new primary 
input node in the underlying AIG— or an existing one when such an assignment 
has already been recorded for that object. These Boolean variables along with 
the constants t and nil are the base Boolean formulas. More complex formulas 
are built up from these variables by processing of if terms and by low-level 
meta-routines, introduced below. 

The Boolean variables needed for the bitcount proof correspond to the bits 
of the accessed registers of the initial machine state st. We introduce rewrite 
rules that cause FGL to generate 32 Boolean variables for the bits of a 32-bit 
register when that register is accessed, composing these into a symbolic integer. 
The two rules involved are shown in Listing 1.8. 
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Listing 1.8. Rules for generating Boolean variables for initial register values 


(def-fgl-rewrite get-st-reg-generate-bits 
(implies (syntaxp (fgl-object-case st :g-var)) 
(equal (get-st-reg n st) 
(zero-extend 32 (hide-get-st-reg n st))))) 


(def -fgl-rewrite zero-extend-const-width 
(implies (syntaxp (integerp n)) 
(equal (zero-extend n x) 
(if (or (not (integerp n)) 
(<= n 0)) 
0 
(intcons (intcar x) 
(zero-extend (1- n) (intcdr x))))))) 


The FGL rewriter will try to apply the first rule, get-st-reg-generate-bits, 
every time it encounters a call of get-st-reg, but due to its syntaxp hypoth- 
esis it will immediately fail if st is not syntactically a variable. In the case of 
the conjecture we’re attempting to prove, this ensures that the rule will only 
apply to get-st-reg calls on the initial state. Such calls will be replaced by 
the zero-extend term of the right-hand side. In that term, hide-get-st-reg is 
an alias for get-st-reg; this avoids looping in the application of the rule. The 
construction of the 32-bit vector of Boolean variables is then accomplished by 
repeated application of the rule zero-extend-const-width. The functions intcar, 
intcdr, and intcons used here to access or construct bits of an integer as if it 
were a list of Booleans: intcar gets the Boolean value of the least-significant bit 
(LSB), intcdr right-shifts by 1 to remove the LSB, and intcons adds a new LSB 
to an integer, reversing the intcdr operation. The first argument to intcons is 
recognized by FGL as a Boolean context, so the rewriter will introduce Boolean 
variables corresponding to the terms that appear there, namely: 


(intcar (intcdr...(intcdr (hide-st-get k st))...)) 


The association of each such termlike object with the corresponding Boolean 
variable is stored in a hash table. Each time a termlike object is found in a 
Boolean context, it is looked up in the table; if it has an existing entry, the 
corresponding Boolean variable is returned, and if not, a new Boolean variable 
is generated and stored. 

After generating the new Boolean variable, the intcons call becomes a new 
symbolic integer that now includes that bit. The final value produced by the zero- 
extend is therefore a symbolic integer consisting of 32 fresh Boolean variables. If 
the same register were to be accessed again, the same process would occur except 
that the objects associated with the Boolean variables would be recognized and 
the same Boolean variables returned again. 


4.3 Composing Boolean Functions 


The most basic way in which a new Boolean formula is computed from a pre- 
vious one during FGL’s rewriting process is by FGL’s built-in handling of if. 
Specifically, if an if term occurs in which the two branches are both symbolic 


40 S. Goel et al. 


Listing 1.9. Bitwise AND implementation rule 


(def-fgl-rewrite fgl-bitwise-and 
(equal (bitwise-and x y) 
(if (int-endp-check x-endp x) 
(if (intcar x) (ifix y) 0) 
(if (int-endp-check y-endp y) 
(if (intcar y) (ifix x) 0) 
(intcons (and (intcar x) 
(intcar y)) 
(bitwise-and (intcdr x) (intcdr y))))))) 


Boolean objects, the result is the Boolean if-then-else of the test formula and 
the two branch formulas. This if-then-else formula is built in the AIG and a 
reference to the resulting node is returned as the Boolean formula resulting from 
the if. If the two branches are both integer values represented either as symbolic 
integers or integer constants, then the result is a new symbolic integer, the bits 
of which are the if-then-elses of the test with the corresponding bits from the 
two branches. 

As a simple example, the rule used to expand calls of bitwise-and is shown 
in Listing 1.9. This rewrites a call of bitwise-and on a pair of symbolic integers, 
producing a new symbolic integer in which each bit’s formula is the AND of the 
corresponding bits of the inputs. 

The rule applies to any call of bitwise-and. It first checks each of the inputs 
with int-endp-check. This is true if it can be syntactically determined that the 
input must be either —1 or 0—in particular, if the input’s symbolic integer 
representation has only one bit. (The syntactic check works by binding its result 
to the free variable x-endp introduced within the form. The technical details of 
this rewriter feature are described elsewhere [39].) If this is true of either input, 
then the result is based on the one relevant bit of that input (the intcar): if it is 
true, then the input’s value is -1 and the result is the other input (coerced to an 
integer value using ifix, which replaces non-integer values with 0); if false, then 
the input’s value is 0 and therefore the result is too. In many cases, the intcar 
value will be a (non-constant) Boolean formula; the result of this if is then a 
new vector of Boolean formulas, each of which is the conjunction of the intcar 
formula with the corresponding bit of the other input. 

If the int-endp-check test is false on both inputs, then the rule creates the 
first bit of the result by creating the and of the first bits of the two inputs. (In 
ACL2, (and x y) is really shorthand for (if x y nil), so this is actually another 
if merge operation.) It then makes another call of bitwise-and on the remaining 
bits of the two inputs, which will cause another application of this rule; this 
recurs until the bits of one of the inputs are exhausted. 

The bitwise-and rule is a particularly simple example of how FGL can be 
programmed to compute complex Boolean formulas, but designing and proving 
these sorts of rules for other operations is a straightforward exercise in interactive 
theorem proving. FGL also includes a library of such rules which the user can 
safely extend with new rewrite rules as needed. 

For some applications, the performance of stepping through iterative rules 
such as these using the rewriter is insufficient. For these cases, FGL supports 
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creating custom rewriting procedures analogous to ACL2’s metafunctions [23] 
and invoking them via rules similar to ACL2’s meta rules. Metafunctions operate 
directly on the syntactic forms to be rewritten—symbolic objects in FGL, terms 
in ACL2. They return a resulting term (and substitution in FGL, though not 
in ACL2) that is equivalent to the input object. To allow a metafunction to 
be applied during rewriting, a meta rule is admitted, which requires proving a 
theorem stating that the metafunction produces correct results. It is noteworthy 
that FGL itself is proven in ACL2 to produce correct results even with user 
extension via rewrite rules or custom rewriting procedures. 


5 Conclusion 


Over the past years, formal verification at Centaur has moved beyond its previous 
focus on data-path proofs for arithmetic modules. Our verification projects have 
expanded into the areas of front-end decoding and microcode, as well as the 
implementations of a rich set of microoperations. We engage with the design 
process in its early stages and maintain and expand our proofs throughout the 
whole life cycle of the project. Over the years, our tools have been improved and 
we have learned a few lessons. 

We chose to use open-source tools and we are constantly contributing to 
ACL2 libraries. The ACL2 community has a tested way of collaboration between 
groups using git, peer reviewed commits, and a rich regression suite. 

We write specifications that can be expanded and refined in response to 
design and microarchitectural changes. When the design is incomplete, the spec- 
ifications are still useful when augmented by relevant assumptions. When a 
project requires additional flags or features, a modular style of specification 
allows for appropriate changes. We try to avoid complex specifications like those 
for the front-end decoder or ROM instruction decoder. These parts of the design 
are implicitly verified during microcode verification. 

Scheduled, triggered, and manual regressions are an important safeguard to 
avoid breaking consistency among our proofs. They catch undesirable changes 
in the specifications, tools, and design. 

A key to ensuring stability of the proofs is their scope—the bigger the scope, 
the more stable the proofs, because changes to interfaces of larger modules are 
less frequent than changes at lower levels. The transition from unit to cluster- 
level proofs led to substantially higher robustness and easier maintenance. This 
has been possible due to improvements in the process of building our formal 
models and enhancements in FGL. We also benefit greatly from enhancements 
in modern SAT solvers. 

We still have considerable work to do towards achieving our verification goals. 
Some of these goals could be achieved with more man power, whereas for others 
we do not have the right technology yet. There is a lot of microcode left to 
be verified. We have not verified the mechanisms of out-of-order microoperation 
scheduling, but we believe it is possible with our tools. We do not have a complete 
methodology for verification of memory access instructions yet. Our plan is to 
work on all these fronts. 
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Abstract. This paper is a tutorial on algebraic program analysis. It 
explains the foundations of algebraic program analysis, its strengths and 
limitations, and gives examples of algebraic program analyses for numer- 
ical invariant generation and termination analysis. 


1 Introduction 


This tutorial provides an introduction to algebraic program analysis, focusing 
upon techniques for (numerical) invariant generation and termination analysis. 
By reading this paper, you will learn the answers to the following questions: 


— How does one design an algebraic program analysis? 

— What new opportunities does algebraic program analysis enable? 

— What are the limitations and important open problems in algebraic program 
analysis? 


The origin of algebraic program analysis is the algebraic approach to solving 
path problems in graphs [1,6,48,59]: (1) compute a regular expression recogniz- 
ing a set of paths of interest, and (2) interpret that regular expression within an 
algebraic structure corresponding to the problem at hand. Various path problems 
(e.g., computing shortest paths, path-finding problems, and dataflow analysis) can 
be solved by using different algebraic structures to interpret regular expressions. 

In the context of program analysis, the graph of interest is a control flow 
graph for a program, and the algebra defines a space of summaries (approxima- 
tions of program behavior) and a means for composing them. The algebraic app- 
roach amounts to computing a summary for a program in “bottom-up” fashion, 
building summaries for larger and larger subprograms by applying the operators 
of the summary algebra. 

The general pattern of an algebraic program analysis is: given a system of 
(recursive) equations defining the semantics of a program, (1) symbolically com- 
pute a closed-form solution, and then (2) interpret the closed form within an 
algebraic structure corresponding to the analysis. The algebraic approach can 
be contrasted with classical iterative abstract interpretation, which also starts 
with a system of (recursive) equations defining the semantics of a program. How- 
ever, the iterative approach is to (a) interpret the operations in the equations in 
an abstract domain, and then (b) solve the equations over the abstract domain 
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by successive approximation. Thus, the classical approach is one of “interpret 
and then solve,” whereas the algebraic approach is “solve and then interpret.” 

The algebraic approach can be applied to various kinds of equations and alge- 
braic structures. Three cases we consider in this article, and the corresponding 
kind of program-analysis problems they can be used to solve, are: 


Section 2 (Non-recursive) program summarization: left-linear equations over reg- 
ular algebras. 

Section 4 Linearly-recursive procedure summarization: linear equations over 
tensor-product domains. 

Section 5 Conditional termination analysis: right-linear equations over w-regular 
algebras. 


Why Algebraic Program Analysis? Algebraic program analysis is a general frame- 
work for understanding compositional program analyses. The principle of com- 
positionality states that “the meaning of a complex expression is determined by 
its structure and the meanings of its constituents” [57]. A program analysis is 
compositional when the result of analyzing a composite program is a function 
of the results of analyzing its components. Compositionality enables program 
analyses to scale to large programs, to be parallelized, to be applied incremen- 
tally, and to be applied to incomplete programs [18]. Algebraic program analysis 
provides a structure in which to think about how to design such an analysis. 

Insistence upon compositionality also demands a different perspective on 
program analysis, which can suggest solutions to problems that may otherwise 
not be apparent. We demonstrate this principle with a series of examples that 
illustrate a variety of different ideas that are enabled by thinking of program 
analysis in compositional terms. 

Last, the algebraic framework enables a style of reasoning about the behavior 
of program analyses themselves. By exploiting compositionality, it is possible to 
design effective algebraic analyses that satisfy certain laws (e.g., monotonicity— 
“more information in yields more information out”). Analyses can be classified 
on the basis of algebraic laws that they satisfy, and we can reason how program 
transformation affects analysis using these laws. 


Why Not Algebraic Program Analysis? While compositionality brings many 
desirable properties, it comes at the price of losing context. Compositionality 
requires that the analysis of a program component is a function of the source 
code of that component, and therefore cannot depend on the surrounding con- 
text in which the component appears in the program. Many program analysis 
techniques make essential use of context, for example: 


— In an iterative abstract interpreter, which propagates information about 
reachable states from the program entry forwards, the analysis of a com- 
ponent depends on every component that may precede it in an execution. 

— In a refinement-based software model checker, which inspects paths that go 
from entry to an error state, the analysis of a component depends on the 
whole program. 
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One of the main challenges of designing a good algebraic program analysis is to 
overcome this loss of contextual information. 

Secondly, algebraic program analysis is less general than iterative program 
analysis, in the sense that any set of semantic (in)equations can be solved itera- 
tively using the same basic algorithm, whereas each particular type of equation 
system requires a specialized algorithm. Some problems—e.g., resolving semantic 
equations of recursive procedures—have no known practical algebraic solutions. 


2 Regular Algebraic Program Analysis 


This section describes the algebraic approach to solving path problems in graphs 
[1,6,48,59]. The basic structure of the method is to use regular expressions to 
capture the set of paths of a graph, and then interpret these expressions to 
obtain a desired result. We illustrate the approach by considering the problem 
of computing shortest paths, and then show how it can be applied to numerical 
invariant generation. 

First, we establish some basic definitions. The syntax of regular expres- 
sions over an alphabet X is as follows: 


aes 
R € RegExp() s=al0|1|Ri+ Ro|Ri- Rə | R* 


We will sometimes use juxtaposition Rı Rə (rather than Rı - R2) to denote con- 
catenation. 

The semantics of regular expressions over X is given by a X-interpretation 
J = (A, f), which consists of regular algebra A and a semantic function f. A 
regular algebra A = ( A,04,14,+4,.4, a is an algebraic structure consist- 
ing of a set A (called its universe) equipped with two distinguished elements 
04,14 € A, two binary operations +^ (choice) and -4 (sequencing), and a 
unary operation (=)= (iteration).! When the algebra is clear from context, we 
will drop the superscript. A semantic function f : X — A maps each letter in 
X to an element of A’s universe. 

A +-interpretation % = (A, f) assigns to each regular expression R over X 
to an element [R] of A by interpreting each letter according to the semantic 
function and each regular operator using its counterpart in A: 


JF [o] = 04 J [R1 ; Ro] = F [R1] 4 [Re] 
[1] = 14 J [Ri + Ro] = [Ri] +4 ¥ [Ro] 
Ija] = f(a) Frac X IJR] = FJR] 


Notice that the interpretation is compositional: for any expression R, J |R] 
is a function of the top-level operator in R and the interpretations of its sub- 
expressions. 


1 Note that no particular laws are assumed to govern these operations. We will return 
to this issue in Sect. 3. 
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Example 1 (Standard interpretation). The standard interpretation of regular 
expressions is the language interpretation, “& = (L,¢) where L is the regular 
algebra of languages. The universe of the interpretation is the set of regular lan- 
guages over X, 0 £ @ is the empty language, 1 £ {e} is the singleton language 
containing the empty word, and the operators are 


X+YSXuUY Union 
X-YA{ay:2e X,yeY} Concatenation 
X* £ {ayre.. Bn 2 £1,- En E X} Kleene closure 


The semantic function ¢ maps each letter a to the singleton language {a}. For 
any regular expression R, [R] is the (regular) set of words recognized by R. 4 


We now describe how non-standard interpretations can be used to solve prob- 
lems over directed graphs. A directed graph G = (V, E) consists of a finite set 
of vertices V and a finite set of directed edges E C V x V. A path in G is a 
finite sequence e1e€2...€, with e; E€ E such that for each i, the destination of e; 
matches the source of e;,;. A path expression (in G) is a regular expression 
over the alphabet of edges E that recognizes a set of paths in G. For any pair 
of vertices u,v € V, there is a path expression PathEzpg(u,v) that recognizes 
exactly the set of paths in G that begin at u and end at v. There are several ways 
to compute path expressions. The classical method is Kleene’s algorithm [44] for 
computing a regular expression for a finite state automaton (thinking of G as an 
automaton over the alphabet F with start state u and final state v). For sparse 
graphs, there are more efficient alternatives to Kleene’s algorithm, in particular 
Tarjan’s algorithm [58]. The insight of the algebraic approach to path problems 
is that these algorithms can be re-used for multiple purposes: first use a path 
expression algorithm to find a regular expression recognizing a set of paths of 
interest, and then compute a problem-dependent (non-standard) interpretation 
of that expression. 


Example 2 (Shortest paths). Consider the integer-weighted graph depicted in 
Fig. la. Suppose that we wish to compute the length of the shortest path from 
a to c. We begin by computing a path expression recognizing all paths from a 
to c: 


((a, b) (b, d) ((d, e) (e, d))* (d,a))” (a, b) ((b, c) + (b, d) ((d, e) (e, d))* (d, ¢)) 


This path expression can be represented succinctly by the directed acyclic 
graph (DAG) pictured in Fig. 1b. Define the distance interpretation Y where 
the semantic function maps each edge to its weight, and the algebra’s universe 
consists of the integers along with too, 0 is interpreted as oo, 1 as 0, and the 
operators are as follows: 
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/ 15 
b £ (b, c) 
17 | ~o I 
a< -1 >e l 
Si bee a’ 
an al (a,b) . 
1\ Ji J 
2 -1 AAN 
\ M í (d, c) 
e td \ 


(a) An integer-weighted graph (0, d) K 


ZN 


(b) Path expression DAG 


Fig. 1. An integer weighted graph and a path expression DAG representing the paths 
from a to c 


dı + dz = min(dı, d2) Minimum 
dı - dg £ dı + d2 Addition 
E if 
tel” = ats " Closure 
0 otherwise 


The weight of the shortest weighted path from a to c is 9] PathExpg(a, c)] = 1, 
which can be calculated efficiently by interpreting the path expression DAG 
“bottom-up” (see gray labels in Fig. 1b). å 


Algebraic path-finding can be used to generate invariants by representing a 
program by a control flow graph, and interpreting path expressions within an 
algebra of program summaries. A control flow graph (CFG) G = (V, E,r,C) 
is a directed graph (V, Æ) with a distinguished root (or entry) vertex r € V, 
and where each edge e € E is labeled by a command C(e); see Fig. 2a for an 
example. In the remainder of this section, we give examples of interpretations 
that can be used to generate (numerical) program summaries. 


2.1 Transition-Formula Interpretations 


Fix a finite set of variables, X, representing the variables of a program. A tran- 
sition formula is a logical formula F(X, X’) whose free variables range over X 
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and a set of “primed copies” X’ £ {x' : x € X}. For the purposes of this expo- 
sition, we further suppose that variables range over integers, and that transition 
formulas are expressed in the language of linear integer arithmetic. A transition 
formula can be interpreted as a binary relation —>p over states State = Z*, 
where s >p s’ if and only if F is true when s is used to interpret the un-primed 
variables and s’ is used to interpret the primed variables. For example, if F is 
the transition formula 


FSe¢=2c4+1Ay=yArc<y, 
then we have 
soaps => s(x) = s(x) +1,5(y) = s'(y), and s(x) < s(y) . 


Suppose that G = (V, E, r, C) is a control flow graph, where commands range 
over assignments x := e and assumptions [c], where e is a linear integer term 
and c is a linear arithmetic formula. (An assumption [c] is a command that does 
not change the program state, but which can only be executed if the formula c 
holds.) We define a semantic function {f that maps each control flow edge into 
the universe of transition formulas by translating the command associated with 
the edge into logic: 


= (a =e)A yest y= y) if C(e)isz := e 
if(e) = cA es y = y) if C(e) is [c] 


We define an algebra of transition formulas as follows: 


0 = false Empty relation 
ie \ L =r Identity relation 

xrEX 
F+G£&FVG Union 
F -G £ 3X" F(X, X") AG(X", X’) Relational composition 


Above and elsewhere, we use positional notation for substitution; e.g., F(X, X”) 
denotes the formula obtained by replacing all the X’ symbols with “double 
primed” symbols in X” (and leaving the un-primed X symbols as they are). 
Intuitively, F* should be interpreted as the reflective transitive closure of F. 
However, in general it is not possible to compute the reflexive transitive closure 
of a formula (nor even to represent it as a formula). Hence, we must be content 
with an over-approximate transitive closure operator. There are many different 
methods for over-approximating transitive closure, so we speak of the family of 
algebras of transition formulas, which have the same basic structure and dif- 
fer only in the interpretation of the iteration operator. In the remainder of this 
section, we describe a selection of methods for implementing the iteration opera- 
tor. Disclaimer: for each example, the presentation differs somewhat (sometimes 
substantially) from the cited source. The examples should be read as “how the 
cited analysis might be presented in the algebraic framework.” 
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Example 3 (Transitive Predicate Abstraction [47|). Fix a set of variables X. Say 
that a transition formula p(X, X’) is 


— reflexive if \,cx £ = 2’ H p(X, X’) 
— transitive if p(X, X’) A p(X’, X”) E p(X, X”) 


Let P be a finite set of candidate reflexive and transitive transition formulas. 
For example we might choose 


{ama’: x E X, KE {<,>}} 


A 
> ined a elie e Koei e.2 Si} 


We can define an iteration operator that over-approximates the reflexive transi- 
tive closure of a formula F by the conjunction of the subset of P that is entailed 
by F: 


FS A{peP:F Ep}. J 


Example 4 (Interval analysis [51]). Let F(X, X’) be a transition formula. An 
inductive interval invariant for F assigns to each variable x € X a pair of integers 
az,6, E€ Z such that if s is a state such that s(x) € [az, bz] for all x € X and 
s >p s’, then s'(x) € [az, bz] for all x € X. Monniaux showed that it is possible to 
determine optimal inductive interval invariants by posing the inductive-invariance 
condition symbolically and quantifying over the bounds [51]. 

Let P = {ps : x E€ X} and Q = {qz : £ E€ X} be sets of fresh variables, which 
we use to the lower and upper bounds of intervals, respectively. The set of 
inductive interval invariants for a formula F' can be represented by the formula 


Ino(F, P,Q) = VX, X’. (rn N vx sasa) => (A Px T 


x£xEX LEX 


That is, the models of Inv (which assign integers to the lower and upper bound 
variables P and Q) are in one-to-one correspondence with the interval invariants 
of F. We may universally quantify over all inductive interval invariants to arrive 
at the following iteration operator: 


F* 2YP,Q. (meron A pesesar| => (A meet sa) 


LEX cTEX 


In contrast to the typical iterative approach with classical widening and nar- 
rowing operators, this operator computes a formula that implies all (and 
therefore most precise) inductive interval invariants.? For example, for the 


? Note that while the formula implies all interval invariants, it does not itself take the 
form of an interval invariant. 


Algebraic Program Analysis 53 


loop (while (i #4 n) do i := i + 1), this method yields the following over- 
approximation of the reflexive transitive closure of F: 


F* =n =nni<iA(i<nsit' <n) 


If we suppose that 7 is initially 0 and n is initially 100, then this formula implies 
the loop invariant that n is equal to 100, and 7 is in the interval [0, 100]. 4 


Example 5 (Recurrence analysis [4,27|). Let F(X, X’) be a transition formula, 
and let x and x’ denote vectors containing the variables X and X’, respectively. 
A linear recurrence inequation of F is a formula of the form atx’ < aT™x + b that 
is entailed by F. The idea behind recurrence analysis is to extract a set of linear 
recurrence inequations for a formula, {a]x’ < a]x + b;},.,, and to use the closed 
form of those recurrences to over-approximate the transitive closure of F: 


pea 


kk > 0A N alx < alx + kbi 
icI 
For instance, consider the following loop: 


while (z > 0) do 
if (y <0) {@:=a+y; y:=y-1} 
else { z := z - 2; y := y - 3} 


The loop exhibits the following recurrences 


< (2x—y)-1 Sai ey Fa =i 
<y-1 or in matrix form, |O 1 H < jO 1 A + ]-1 
<-y+3 o —1| LY o—1| 4 3 


which yields the following transition formula that summarizes the loop: 


k.k > OA (22 — y’) < (Qe—-y')—kAy <y—kA-y' <—-y4+3k. 


The loop also exhibits other recurrences (such as x’ < x — 1); however, the three 
selected recurrences are complete in the sense that all implied recurrences are 
non-negative linear combinations of these three (e.g., 2’ < x — 1 is obtained by 
adding 1/2-times the first and second recurrences). 

Such a complete set of recurrences exists for any transition formula F, which 
can be computed as follows. First, observe that the set of linear recurrences of F, 


Rec(F) = {(a,b) : F H ax’ < atx +b} 


is closed under non-negative linear combinations (i.e., it is a convex cone). Our 
goal is to find a (finite) set of generators for Rec(F)—a finite set {(aj, bi) beg 
such that 


Rec(F) = fio + XC A:(ai, bi) : ào > 0, à; > 0 for all į € a} : 
i€B 
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To compute generators for Rec(F'), we first introduce a fresh set of “difference” 
variables, {ðs} cx and form a formula 


A(F) £ 


X, X Ea N 6,=2'-«. 
rex 


Observe that (a,b) € Rec(F) if and only if A(F) — a1 < b. Thus, a set of 
generators for Rec(F) corresponds exactly to a half-space representation for the 
convex hull of A(F), which can be computed using the algorithm from [27]. 

The class of linear recurrence inequations considered in this example can be 
generalized in various ways to yield more powerful invariant generation proce- 
dures. In particular, 


— [27] computes linear recurrences with polynomial closed forms 

— [42] computes polynomial recurrences with polynomial and complex exponen- 
tial closed forms. 

— [41] computes polynomial recurrences with polynomial and rational exponen- 
tial closed forms. 4 


2.2 Weak Interpretations 


Transition formulas are an appealing basis for algebraic program analysis, since 
all the operators (except the iteration operator) are precise—they simply encode 
the meaning of the program into logic. The significance of this is that transition 
formula algebras delay precision loss as long as possible, which helps to overcome 
loss of contextual information. However, there are algebraic analyses of interest 
that are defined on weak logical fragments that cannot precisely express union 
and/or relational composition. 


Example 6 (Affine relation analysis [38]). An affine relation is a relation that 
corresponds to the set of models of a transition formula of the form Ax’ = Bx-+c. 
Define the algebra of affine transition relations to be the regular algebra where 
the universe is the set of affine transition relations, 0 is interpreted as the empty 
relation, 1 is interpreted as the identity relation, + is interpreted as the affine 
hull of Rı U Rə (the smallest affine relation that contains both Rı and Rg), - is 
interpreted as relational composition, and * is interpreted as the operation that 
sends any affine relation R to the limit of the sequence {R;}72, defined by 


Ro =0 Rizr = Ri + (Ri- R) fori > 0 


Since we have Ro C Rı C ... and if any Ri+ı properly contains R; the dimension 
of Ri+ı is strictly greater than that of Ri, this sequence must stabilize in finite 
time, so the operation R* is computable. sj 


3 Semantic Foundations 


This section presents a general view of algebraic program analysis, with the goal 
of elucidating its underlying principles so that they may be understood outside 
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the setting of graphs and regular expressions. This sets the stage for Sect. 4 and 
Sect.5, wherein we will develop program analysis schemes that follow the same 
general “recipe” that we lay out in this section, but deviate from the instance 
of this recipe that we saw in Sect. 2. 

Following the theory of abstract interpretation [22], we begin with a concrete 
semantics that defines the meaning of a program. The concrete semantics is 
specified as the least (or greatest) solution to a system of recursive equations. 
The concrete semantics is not computable—the goal of a program analysis is 
to approximate it. The way that this is accomplished in an algebraic analysis 
is by symbolically computing a closed-form solution to the semantic equations 
(i.e., a non-recursive system of equations whose (unique) solution coincides with 
the concrete semantics), and then interpreting that closed-form solution in an 
algebraic structure that approximates the algebra of the concrete semantics. 


3.1 Semantic Equations 


Given a control flow graph G, we can syntactically derive a system of equations 
E(G)—-see Fig. 2. For each vertex v, we introduce a variable X, and an equation 
(X, = R,) that relates that variable to the variables for v’s predecessors. Notice 
that this system of equations can be viewed as a (left-)regular grammar, with 
each non-terminal symbol X, recognizing the set of paths from the root r to the 
vertex v. This is an instance of the more general concept of a solution to a system 
of equations over an algebraic structure. A solution to the system of equations 
E(G) = {Xv = Ry} ,cy over a regular interpretation .% = (A, f) is a function o 
that maps each variable to an element of A such that each equation is satisfied: 
for each equation (X, = R,) in E(G), we have o(X) = J |R], where s, is 
the interpretation obtained by extending the semantic function to variables by 
interpreting them according to ø. 

The prototypical concrete semantics of interest in algebraic analysis is the 
relational semantics. The relational semantics of a program associates to every 
control flow vertex v a reachability relation R,, which is the set of pairs (s, s’) 
such that if the program begins at r in state s, then it may reach v with state 
s’. The relational semantics may be obtained as the least solution to the sys- 
tem of semantic equations over the relational interpretation, which is defined as 
follows. The regular algebra of state relations, R, has binary relations on states 
as its universe, 0 is interpreted as the empty relation Ø, 1 is interpreted as the 
identity relation {(s,s) : s € State}, - is interpreted as relational composition, 
+ as union, and * as reflexive, transitive closure. The relational interpreta- 
tion Z is the interpretation over the regular algebra of state relations where the 
semantic function maps each command to its associated transition relation; e.g., 
i := į + 2is associated with the set of all pairs (s, s’) such that s'(i) = s(#) +1 
and s'(x) = s(x) for all x # i. The relational semantics of a CFG G is the least 
solution to E(G) over the relational interpretation. 

Having formulated the concrete semantics as the solution to a system of 
equations, we must now solve the system symbolically. The classical algorithm 
is a variation of Gaussian elimination, given in Algorithm 1. This algorithm 
is essentially Kleene’s algorithm [44] for computing a regular expression for a 
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| X, =1 
tia 0 Xq = Xr (r;a) 
y : X, = Xa- (a,b) 
| ae + Xa: (d,b) 
j i= 0 [i < 1000] i:=it+2 + Xe-(e,b) 
rai > 500] : Xe = Xp - (b,c) 
l a | Xa = Xe- (c,d) 
i > 1000 j:=j+1 [j < 500] Xe = Xa- (d, 6) 
y i Xp = Xa- (b, f) 
(a) (b) 
X.=1 
Xa = (r,a) 
Xe = (r,a) (a, b) ((b, c) (c, d) ((d, b) + (d, e) e, b))* 
Xe = (r,a) (a, b) ((b, c) (c, d) ({d, b) + (d, e) e, b))* (b, c) 
Xa = (r,a) la, b) ((b, c) (c, d) ((d, b) + (d, e) e, b))* (b, c) (c, d 
Xe = (r,a) (a,b) ((b, c) (c, d) ({d, b) + (d, e) e, b))” (b, c) (c, d) (d, e) 
Xp = (r,a) (a, b) ((b, c) (c, d) ((d, b) + (d, e) e, b))” (b, f) 


Fig. 2. (a) A control flow graph; (b) the corresponding systems of equations; and (c) 
a closed-form solution. 


finite state automaton, recast in the language of equations. The front-solving 
step eliminates variables one-by-one, at each step i producing a system of equa- 
tion of equations that is equivalent to the original, but in which the variable 
X; does not appear in the right-hand-side of any equations X; = Rj for j > i. 
The back-solving step eliminates all variable occurrences from right-hand-sides, 
at each step replacing X; with its closed form R; in each equation X; = Rj for 
j < i. An example illustrating the result of solving the system of equations in 
Fig. 2b symbolically appears in Fig. 2c. The significant difference to the famil- 
iar Gaussian elimination algorithm in linear algebra is the “loop-solving” step, 
which solves a single recursive equation X; = R; symbolically by re-arranging 
R; into the form X; A + B and taking BA* to be the solution. The loop-solving 
step is justified under the relational interpretation, and more generally for any 
interpretation over a Kleene algebra.’ 


3 The laws of Kleene algebra are not minimal in this regard. 


Algebraic Program Analysis 57 


Input : Left-linear system of equations, E = {X; = Ri} 

Output : Closed-form solution to Æ 

for i = 1 ton do /* Front-solving */ 
Re-arrange R; in the form X;A + B; 
Ri — BA*; /* “Loop-solving” */ 
foreach j > ido Rj — R;j[X > Ri] ; 

end 

for i = n to 2 do /* Back-solving */ 

| foreach j <i do Rj — Rj[X: = Ri] ; 

end 

return E; 

Algorithm 1: Gaussian elimination for left-linear systems of equations 


Definition 1. Let A = (A, +,-,*,0,1) be a regular algebra. We say that A is 
an idempotent semiring if it satisfies the following (for all a,b,c, € A): 


a+(b+c)=(a+b)+c _ a(bc) = (ab)c Associativity 
a(b+c)=ab+ac (b+ c)a= ba+ ca Distributivity 
a+0=a la=al=a Identity 
a+b=b+a Commutativity of + 

a+a=a Idempotence 

a0=0a=0 Annihilation 


In any idempotent semiring, we may define a natural order <, where a < b iff 
a+b=b. Note that + is the least upper bound with respect to this order. 

We say that A is a Kleene algebra if it is an idempotent semiring and the 
following hold (for all a,x € A): 


1 +a(a*) = a* 1+ (a*)a = a* Unfolding 
ar < rt >a*r<r zra< r> raž <r Induction 


Exercise 1. Show that in any Kleene algebra, the least solution to a (left-)linear 
recursive equation X = a+ Xb exists and is equal to ab* 


The sense in which Gaussian elimination computes a “closed-form solutions” 
to a system of left-linear equations E is that: 


— (closed form) the right-hand sides do not refer to variables, and 

— (solution) for any interpretation J over a Kleene algebra, for each equation 
(X = R) € E, we have o(X) = J[R] where ø is the least solution to Æ 
over F. 


The connection between Gaussian elimination and graph algorithms like 
Floyd-Warshall inspired Tarjan’s path-expression algorithm [58]. In the language 
of graphs, Tarjan’s algorithm computes for each vertex v of a control flow graph 
G with root r a path expression PathEzpg(r,v) that recognizes the set of paths 
from r to v; in the language of equations, it solves left-linear systems of equations 
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symbolically. Tarjan’s algorithm is preferred to Gaussian elimination in practice: 
is more efficient (nearly linear time for reducible flow graphs, compared to cubic 
time for Gaussian elimination) and produces simpler solutions. For expository 
purposes, we will continue to refer to Gaussian elimination for solving systems 
of equations, viewing Tarjan’s method as an efficient variation. 


3.2 Abstract Interpretation 


Gaussian elimination can solve a system of left-linear equations over a Kleene 
algebra (e.g., relational semantics) symbolically. However, the solution cannot be 
interpreted in the concrete algebra, since operators are not effective (that is, they 
cannot be implemented by a machine). We approximate the concrete semantics 
by interpreting the closed-form solution in an effective abstract algebra (e.g., one 
of the transition-formula algebras from Sect. 2). 

Following the theory of abstract interpretation [22], the correctness of this 
approach is justified by establishing a relationship between the “concrete” and 
“abstract” interpretations. In the algebraic framework, a natural way to express 
the relationship is via a soundness relation [24], which is a binary relation 
between two algebras that is preserved by the operations of the algebra. Mem- 
bership of a (concrete, abstract) pair in the relation indicates that the concrete 
element is approximated by the abstract element. 


Definition 2 (Soundness relation). Given two X-interpretations J’ = 
(A4, f’) and J? = (Al f®), — I- — C A! x AË is a soundness relation 
if f'(a) I- f*(a) for alla € X and I- is a sub-algebra of the product algebra 
A? x A}; i.e., O8 IH OË, 191k 1?, and for all zı I- yı and zə lH yo we have 


- gı +4 ae Ik yı + yo 
- 4 aq lF yı “4 Y2 
- at" IF yf 

The definition of soundness relation generalizes to interpretations over other 
classes of algebraic structures in the natural way: it is a binary relation over 
two algebras of the same signature that is preserved by every operation in the 
signature. 


Example 7 (Transition formula overapprozimation). Let R denote the algebra 
of state relations and TF denote an algebra of transition formulas. The over- 
approximation relation is defined by 


Rilo F 4 Vis,s')ER sors’. 


Preservation of constants and the sequencing and choice operations is easily veri- 
fied; to show that Iko is a soundness relation, we need only to show that Rlko F 
: . R TF . TF : : P 
implies R* lHo F* ;i.e.,(—)* over-approximates reflexive transitive closure. 
Of course, this proof depends on the particular implementation of the iteration 
operator. 

The over-approximate soundness relation allows us to verify safety properties: 
if RIFo F and F entails some property P, then R satisfies P. J 
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Example 8 (Transition formula underapproximation). The under-approzimation 
relation is defined by 


Rity F => \s,s'.s >p 3 => (s,s) € R, 


Preservation of constants and the sequencing and choice operations is again 
easily verified; to show that |ky is a soundness relation, we need only to show 
that Rlko F implies R IFo Fr. i.e., Sia under-approximates reflexive 
transitive closure. The iteration operators in Sect.2 are all over-approximate. 
An example of an under-approximate iteration operator is 


a 


i times 


(for some fixed choice of n) which corresponds to bounded model checking [9], 
with an unrolling bound of n. 

The under-approximate soundness relation allows us to refute safety prop- 
erties: if R lky F and F does not entail some property P, then R does not 
satisfy P. J 


The problem of “approximating the behavior of a program” can be formalized 
as follows: 


Given a system of semantic equations over a set of variables ¥ describing 
the concrete semantics of a program (i.e., its least solution o” over some 
interpretation 2"), find some oË : Æ — AË such that for each variable 
X € X, we have o5(X) IF o#(X). 


The algebraic approach to this problem is to compute for each variable X a 
closed form Rx (such that o4(X) = .75(Rx)), and define o#(X) & .7#(Rx). 
The correctness of this approach is justified by the following soundness lemma, 
which follows by induction on regular expressions. 


Lemma 1 (Soundness). Let X be an alphabet, let J” = (At, fE) and J! = 
(AË, f*) be S\-interpretations, and let |kKC A? x AË be a soundness relation. Then 
for any regular expression R € RegExp( X), we have 4°] I- 4*[R] 


3.3 Discussion 


A subtlety of algebraic program analysis is that most algebras of interest in pro- 
gram analysis are not Kleene algebras (for instance, none of the algebras in Sect. 2 
are), and so in general, Gaussian elimination does not find solutions to systems of 
equations over “abstract” interpretations corresponding to program analyses. This 
technical difficulty is sidestepped by appealing to the concrete semantics (which 
typically is defined over a Kleene algebra, such as the algebra of state relations) 
to justify the use of path-expression algorithms, and a sound approximating alge- 
bra to interpret the resulting expressions. The fact that the abstract interpreta- 
tion of the closed-form solution to the concrete system of equations does not yield 
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a solution to the abstract system of equations is immaterial: our goal is to over- 
approximate the concrete rather than solve the abstract. 

Formalizing a program analysis as an algebraic structure allows one to under- 
stand the behavior of program analyses in terms of algebraic laws, and use the 
language of algebra to reason about program analyses. For example, any tran- 
sition formula algebra (in the family described in Sect. 2.1) is an idempotent 
semiring, and so any two «free regular expressions that denote the same lan- 
guage have the same (up to logical equivalence) interpretation as a transition 
formula. While none of the iteration operators in Sect. 2.1 satisfy the Unfolding 
and Induction laws of Kleene algebra, they do satisfy weaker pre-Kleene algebra 
iteration laws: 


1<a* Reflexivity 

a < a* Extensivity 
a*a* = a* Transitivity 
a < b> a* < b* Monotonicity 


For any n, (a”)* < a* Unrolling 


A concrete use-case for these laws appears in [25], which develops regular expres- 
sion transformation techniques that preserve concrete semantics but are guar- 
anteed to produce (non-strictly) more precise abstract semantics. 

Such laws can also be useful for users of program analysis tools. For exam- 
ple, since all operations are monotone (as a consequence of the monotonicity and 
idempotent-semiring laws), a user can rely on the principle that “more informa- 
tion in yields more information out.” If a user alters a program P by adding addi- 
tional assume commands to get a program P’ (e.g., expressing invariants that 
are found by some other automated invariant generation technique, user-provided 
hints, etc.), monotonicity means that they may rely on the fact that the analysis 
will produce summaries for P’ that are at least as precise as those for P. 


A Recipe for Algebraic Program Analysis. We conclude this section by presenting 
a general view of algebraic program analysis, abstracted from the language of 
graphs and regular expressions: 


1. (Modeling) Express the concrete semantics as the least (or greatest) solution to 
a system of recursive equations (e.g., relational semantics as the least solution 
to the left-linear system of equations corresponding to a control flow graph). 

2. (Closed forms) Design a suitable language of “closed-form solutions” and an 
algorithm for computing them (e.g., regular expressions and path-expression 
algorithms). 

3. (Interpretation) Design an abstract interpretation of the language of closed 
forms and a soundness relation connecting the concrete and abstract interpre- 
tations (e.g., transition-formula algebras (Sect. 2.1) and the over-approximate 
soundness relation (Ex. 7)). 


Section 4 and Sect. 5 give two more instances of this generic recipe, generalizing 
beyond left-linear equations and regular-expressions as closed forms. Section 4 
considers linear equations (and an appropriate language of closed forms); Sect. 5 
considers another form of equation with w-regular expressions as closed forms. 
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4 Interprocedural Analysis 


Algebraic program analyses are oriented around computing summaries for pro- 
gram fragments, and are naturally suited to analyzing programs with procedures. 
Following Cousot & Cousot [23] and Sharir & Pnueli [56], the idea is to structure 
the analysis in two phases: 


Phase I: compute for each procedure X a summary that approximates the 
behavior of X (including the actions of all procedures called transitively from 
X). 

Phase II: analyze whole-program paths from the start of the main procedure, 
using the summaries to interpret procedure calls. 


An example of a program with procedures is given in Fig. 3(a). The CFGs for 
its procedures are shown in Fig. 3(b) along with a set of equations corresponding 
to the CFGs (Fig. 3(c)). For Phase I, it is also useful to consider the following 
equations in which we have eliminated all variables except for those of the form 
Xs æ, which represent the procedure summaries. 


Xs1,01 a= ((s1,@) *Xs5,09 + (81, b)) i X 55,09 
X 55,09 > X 53,03 'X55,05 (1) 
Kesra = (53, £3) 


This system of equations can be obtained either by a process of successively 
eliminating variables from Fig. 3(c), or they can be read off directly from each 
control-flow graph: sequential composition corresponds to -, and branching cor- 
responds to +. 

We can also construct a graph of the dependencies among the variables in 
the equation system. In this case, we would have 


Xs3,03 =n; X 59,29 = Xsis (2) 


(which is also isomorphic to the program’s call graph). Note that the equations 
in Eq. (1) are not left-linear. However, by eliminating variables in a topological 
order of Eq. (2), these systems can still be solved using Gaussian elimination 
(Algorithm 1). 


X g5,03 E (83, £3) 
Xso,a9 = (83,23) : (83, £3) (3) 
Xs1,0, = ((81,4) : (83,23) + (83,23) + (81, b)) - (83, £3) - (83, £3) 


Unfortunately, this strategy breaks down for programs with recursive pro- 
cedures: the essential difficulty is in computing the summaries of procedures 
that are directly recursive or part of a set of mutually recursive procedures. We 
will return to this issue shortly, after a brief discussion of Phase IT, which can 
be addressed via algebraic program analysis, regardless of whether the original 
equation system contains recursion. 
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xX, 9 : ) X00) procX, procX, procX; Xss =1 
% ; 
Xe ); me 3i S2 S3 Xs, a= Xs, s1’ (s1,a) 
= + Xsis (s1, 6) 

Xa() { i c x, Aua = Aat Aage 

X0; p 

X30); Xs Xoasa = 1 
} b Xs9,c aE Xso,52 Xs3,03 

k X X gs 25 = $2,¢° X gq ,03 
xa() { er 
t . 83:53 — 
} on a Xs3,03 = Xs3,53 ° (s3, £3) 
(a) (b) (c) 


Fig. 3. (a) A three-procedure program scheme. (b) Control-flow graphs for program 
(a). The edges labeled “X2” and “X3” represent calls to the respective procedures. (c) 
A system of equations corresponding to (b). 
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Fig. 4. Graph corresponding to the equation system used for Phase II for the program 
from Fig. 3. 


With closed-form solutions for the procedure summaries in hand, Phase II 
can be addressed with Gaussian elimination. (Note that for a program with 
recursive procedures, the transformed Phase II system is still recursive. However, 
it is left-recursive, and so can be handled with regular expressions, and analyzed 
using the transition-formula interpretations of Sect. 2—the “loops” in Phase II 
correspond to sequences of recursive calls). Figure 4 shows the equation system 
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Xi proc X4 s X, 
C(s1,0)3 1 Xsis = 1 
X2() | Xs, a = Xs1,51 i (s1,a) 
} b Xsy,21 = Xs,,0°Xs1,09 
a 
Xa() { : [x 
if (x) Cisz,æ2)3 | Xs5,89 =i] 
C(s2,b)3 E x = Xsp,59 ° (82, b) 
while (*) { X k s2 = $ Xa (d,b) 
X2(); ? Xs5,¢ e Xso,b f REN 
X2(); d X zd = Kana Nagas 
} X = X 59,89 i (82, £2) 
Cio,22) saaa + Xso,b° (b, £2) 
} X2 
(a) (b) (c) 


Fig. 5. (a) A two-procedure program scheme, where X: represents the main procedure, 
Xə represents a recursive subroutine, and Cs; a), C(so,22); C(s9,b), and Cb, x3) represent 
four program statements. (b) Control-flow graphs for program (a). The three edges 
labeled “X2” represent calls to procedure X2. (c) A system of equations corresponding 
to (b). 


used for Phase II for the program from Fig.3 in graphical form. The graph 
is similar to Fig. 3(b) with (i) additional edges from each call-site to the start 
node of the called procedure, and (ii) the edges previously labeled with “X2” 
and “X3” are now labeled with the values from Eq. (3) for the corresponding 
procedure summaries: (s3, 73) - (s3, £3) and (s3, £3), respectively. 

The remainder of this section focuses on Phase I: computing procedure sum- 
maries. Consider the two-procedure program shown in Fig. 5(a). CFGs for its 
procedures are shown in Fig. 5(b) along with a set of recursive equations cor- 
responding to the interprocedural CFG. Unfortunately, equations like those in 
Fig. 5(c) do not fit naturally with the recipe given in Sect. 3.3. The essential 
difficulty is with item 3.3: “Design a suitable language of ‘closed-form solutions’ 
and an algorithm for computing them.” In particular, we cannot use regular 
expressions and path-expression algorithms because the equations in Fig. 5(c) 
are not left-linear (and they cannot be put in left-linear form). 

Two ideas are involved in using algebraic program analysis to summarize 
recursive procedures: 


1. The generalization by Esparza et al. [26] of Newton’s method—the classical 
numerical-analysis algorithm for finding roots of real-valued functions—to a 
method for solving a system of equations over a semiring S, called Newtonian 
Program Analysis (NPA). As in its real-valued counterpart, each iteration of 
NPA solves a simpler “linearized” problem. (See Sect. 4.1.) 

2. The technique of Reps et al. [53] for applying the algebraic-program-analysis 
recipe to the linearized problems that arise in NPA. (See Sect. 4.2.) 
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4.1 Motivation: Newtonian Program Analysis 


To motivate why we are interested in the special case of linear equations 
(Sect. 4.2), this section provides a brief overview of how linear equations arise 
in NPA. Let E = {X; = Ri}; be a system of equations, and fix an inter- 
pretation .% over some algebra A. Define a function f : A” — A” by f(a) = 
(%,]Ri],..-,%[Rn]) i.e., the n-tuple of interpreted right-hand-sides, where 
variables are interpreted according to a). NPA is an iterative method for pro- 
gram analysis that solves the following sequence of problems for v: 


vy = £(0) v+!) —=y (4) 


where Y is the value of Y in the least solution of 


Y = f(v™) + LinearCorrectionTerm(E, v™, Y) (5) 


Thus, NPA is similar to Kleene iteration, except that on each iteration, f(v) 
is “corrected” by an amount controlled by LinearCorrectionTerm(E, v, Y)—a 
function of f, the current approximation v), and (vector) variable Y—which 
nudges the next approximation vt) in the right direction at each step. 

The linear correction term is the result of replacing each right-hand side 
Ri =, Ry with a sum 7-9 Rijk, where each R;,j,k is obtained from R; j by 
replacing all variables, except possibly one, with its interpretation in v. (The 
formal definition can be found elsewhere [26, §3.2].) For example, consider the 
system of equations below, a simplified variant of Fig. 5(c) that is obtained by 
eliminating all variables except Xs; ,2,,Xs9,b; Xs9,x9' 


Xsis = (81, a) Xsis 
Xsa,b x (s2, b) + Xsa,b i X 59,20 å X 59,209 i (d, b) (6) 
X 55,09 = (82, T2) + Xs0,b (b, £2) 


The transformation results in the following system (for brevity, we denote 
Ysi ,01 ’ Yazi Yasta by Yi, Yo, Y3): 


Yı = (81,@) > Y3 
Y= (82, b) + Y2 - 13-13 - (d, b) + v2 - Y3 + vs - (d, b) + v2 - vs - Y3 - (d, b) (7) 
Y3 = (82,2) + Y2 - (b, £2) 


Note that the two underlined summands are both truly linear: they are linear, 
but not left-linear nor right-linear. 

The process of solving Eqs. (4) and (5) for v@+, given v™, is called 
one Newton round. On the initial Newton round, we set (uO vO), p — 
(0, 4 [(s2, £2), 2 [(s3, £3)]). On round i + 1, we solve Eq. (7) for (Y1, Yo, Y3) 
with (v1, V2, v3) set to the value WÊ, vt? vf) obtained on round i, and then 
set (VET), uD, LEY) — (Y, Yo, Ya). 
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Operationally, the linearization transformation imposes a particular proto- 
col for sampling the program’s space of behaviors. For instance, in Fig. 5(b), 
the procedure Xə has two call-sites along the loop through b. In Eq. (7), each 
right-hand-side summand in the equation for Y2 has at most one variable: the 
transformation inserted v2 or v3 at various call-sites (considering Xs,» as a 
pseudo-call-site corresponding to tail recursion), and left at most one variable Y; 
in each summand. In essence, during a given Newton round, the analyzer samples 
the behavior of f by taking the + of various paths through the transformation 
of f. Along each path through a (transformed) right-hand side, the summary for 
each pseudo-call-site X; encountered is held fixed at 1;, except for possibly one 
pseudo-call-site on the path, which is explored by visiting (the linearized version 
of) the called procedure. The summaries v1, 12, v3 are updated according to the 
result of this exploration, and the algorithm performs the next Newton round. 

The analogy between NPA and Newton’s method in numerical analysis is 
that in both cases one creates a linear approximation of f(X) around the “point” 
(v,£(v)); the solution of the linear system is the next approximation of X. 


4.2 Algebraic Program Analysis for Linear Equations 


In this section, we instantiate the recipe for algebraic program analysis from 
Sect. 3.3 to solve a system of linear equations, such as the linearized problems 
that arise as Eq. (5) [53]. This goal may seem out of reach because item 3.3 of 
the recipe requires us to “design a suitable language of ‘closed-form solutions’ 
and an algorithm for computing them.” 

What is a suitable language of closed-form solutions of linear equations? 
Clearly the regular expressions and path-expression algorithms used in Sect. 2 
and Sect.3 will not do, because the least solution under the language interpre- 
tation to the (truly) linear equation X = aXb + 1 is {atb : i > 0}, which is the 
canonical example of a linear-context-free language that is not regular. However, 
over fifty years ago, formal-language theorists established that linear-context- 
free languages have certain similarities to regular languages [17,34,61], and we 
can make use of this property to design a language of closed forms for linear 
equations. Intuitively, {afb : i > 0} can be obtained by (i) introducing paired 
alphabet symbols, such as (a,b), (ii) defining concatenation of paired symbols as 
(a,b) - (c,d) = (ca, bd), (iii) defining Kleene-star in the natural way over paired- 
symbol concatenation, so (a, b)* is the language of paired words {(a’, b') : i > 0}, 
and (iv) applying an operation that concatenates the left word and right word 
of each paired word: {(a',b’):i > 0} +> {a'b': i > 0}. 
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For the purpose of algebraic program analysis, this idea can be formalized by 
introducing tensored regular expressions over an alphabet X, whose syntax 
is defined as follows:* 


aes 
R € RegExp(Z) :=a|0|1| Ri + R| Ri- Ro | R* | S 
S € RegExpr(X) ::= Ri ® R2 |0| 1] $1 © S2| 51 © S2 | S® 


We can now follow the pattern of Sect. 2, and define algebras suitable for 
interpreting tensored regular expressions. 


Definition 3. A tensor-product algebra T = (A,T,®,4) consists of two 
regular algebras A and T along with an operation ®: Ax A —> T, called tensor 
product, and an operation 4 : T — A, called detensor. 


Example 9 (Standard interpretation). The standard interpretation from Exam- 
ple 1 can be extended to tensored regular expressions by defining a universe of 
languages over word pairs (“tensored words”) T = 2” "x=" whose operators are 
given by: 


X@YA{la,y): cE X,yeY} 

Zt $ {z2 : (2,2) € Z} 
Z1 © Z2 > { (291, Z122) : (Z1, Z1) € Z1, (Z2, Z2) € Zo} 
Z1 8 Zp = Z, UZ, 


AOS |Z 


iEN 


Note that this interpretation allows tensored regular expressions to be used to 
capture linear context-free languages. For instance, the equation X = aXb +1, 
whose least solution is {abt : i > 0} can be written in closed form as X = 
((a & b)®)#, and the equation X = aXa + bXb +1, whose least solution is 
the language of even-length palindromes over {a,b}, can be written as X = 
(((a@ a) e b88). F 


Example 10 (Relational interpretation). The relational interpretation can be 
extended to tensored regular expressions by defining an algebra of binary state- 
pair relations, as follows. The universe is the set of relations on State x State 
(i.e., an element of the universe is a subset of State x State x State x State). 
Comparing with the standard interpretation, (in which an element (p1, p2) con- 
sists of a “backwards path” p and a “forwards continuation”) we may think of 


4 A warning about notation: in our previous papers, we used @ and ® for the two 
semiring operations, © for tensor product, and @7 and ®7 for the two tensored- 
semiring operations. In this paper, we use + and - for the semiring operations, with 
circles around them for the tensored-semiring versions: 6 and ©. We use ® for tensor 
product, which is consistent with usual mathematical notation. 

5 That is, an element of the algebra is a pair of pairs of states. 
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A 
an element 2) ; o ) of a state-pair relation as consisting of two pre/post 
2 2 


state pairs: a “backwards” pair si *— sı and a “forwards” pair s2 >* sh. In the 
algebra of state-pair relations, 0 is interpreted as the empty relation, 1 as the 
identity relation, and + as union. The remaining operators are given by: 


non- (G) mad eminem] 
Tt = fis PRERA ( (2) s G) ETA = r (8) 
OOIEOE 
T 


T®e=|]JTo...OT 


i times 


Note that the tensored sequencing operation is just a form of relational compo- 
sition (over tuples of stacked elements); similarly, tensored iteration is a form of 
reflexive transitive closure. J 


Example 11 (Transition-formula interpretation). Transition formulas can be 
used to interpret tensored regular expression in a way analogous to the rela- 
tional interpretation (as one should expect, because there must be a soundness 
relation between them!). A tensored transition formula T is a formula over four 
vocabularies, representing the value of the variables before and after a pair of 
computations. The tensor and detensor operations are essentially the same as 
those from the relational interpretation, translated into logic: 


(Fi @ Fo) cc , &) £ Fy(X1, Xj) A F(X, X5) (9) 


T#(X,X') 43 o T (4) ; &) AY, =Y2 


In the Eq. (9), the vocabularies X1, X{, X2, and X} track the original role 
of the respective vocabulary in F} or Fy. The “stacked” notation is intended 
to be suggestive of an interpretation of a tensored transition formula over a 
doubled vocabulary, where the variables are Xj U X2 and their “primed copies” 
are Xı U X}. To make the connection with Sect. 2.1 more apparent, we shall 
define W, = Xi, W2 = X2, Wi = Xı, W} = X}. With this notation, the 
product operation can be defined as: 


WwW, WwW / À Wi 1 WwW WwW, WM WwW 1 WwW / 
ao) ( E ) ~ E Am) ma) J? Awa) -m 
As with the relational interpretation, the product operation is just a form of 
relational composition (over tuples of stacked elements). 
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Remarkably, the algebra of tensored transition formulas is the same as the 
algebra of untensored transition formulas, just over an extended set of variables. 
In particular, the iteration operators from Sect. 3 can be used to implement ®. 
For instance, consider the recursive procedure 


foo): if (*) then a := a+ 1; foo); b:=b+1 


The path to the recursive call of foo and the path from the recursive call to exit 
can be modeled by the transition formulas F and G, respectively: 


Fda =a+1A^Ab =b 
G40 =b+1Ada' =a 


A procedure summary for foo can be calculated by evaluating ((F @G)®)*, using 
recurrence analysis (Example 5) to implement the ® operator: 


FGê£a =a} —1Ab =b] Ab =b +1^a, = a2 
(F 8 G)® £ Jk.k > 0 ^a =a, — k Abi = b] Abh = bz + k Aah = az 
(F@G)®)! E Ikk>0Ad =a+k^Ab =b+k J 


We now show how to compute closed forms for linear equations. First, we 
perform a regularizing transformation, which takes a system of linear equations 
ELin and converts it into a system of left-linear equations ELeftLin. The trans- 
formation takes each right-hand-side term of the form a- Y - b and converts it 
to Z © (a @ b), where Y and Z are variables whose values are elements of the 
regular algebras A and T of a tensor-product algebra (A, T, @, 4). 


Definition 4. Given a linear equation system Ein over the regular algebra A 
of a tensor-product algebra T = (A,T,®,4), the regularizing transforma- 
tion Treg creates a left-linear equation system ELeftLin = TRegl ELin) over T by 
transforming each equation of Erin as follows: 


Yj = ej + 5 (ai j,k: Yi bi j,k) 
= TREG 
Z; = (1 ¢/) @ (ZO (aije @ bije) 
ik 


where Z; and Z; are variables that take on values from T. 


For instance, if the regularizing transformation is applied to the linear system 
of equations in Fig. 6a, the result is the system of equations Fig. 6b. Because 
Fig. 6b is left-linear, we can now use the approach from Sect. 2 and Sect. 3—that 
is, create a closed-form solution for each variable Z; by finding a path expression 
for the variable in the graph Fig. 6c. Finally, one gives a closed-form solution 
for each variable Y; for the linear equation system in Fig. 6a by applying (—)* 
to each path expression—see Fig. 6d. This algorithm for computing closed-form 
solutions to linear equations is justified in the tensored-relational interpretation, 
and more generally, in any interpretation whose algebra forms what we dub a 
Kronecker algebra, defined as follows: 
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Yı = a¥ob + c¥ad Z = (Z2 © (a 8 b)) S (Z2 © (c@d)) 
Yo = e+ fYig Z2 = (1@e) (Z710 (f 89)) 
(a) (b) 
Zi 
18e) : i 

(a @b) (c&d) feg “s5 (2K@shacomouen 

(1 @e) 7 Yə = (18e) © (((a 8b) 6 (c8 d)) © (F @g))®)* 

(c) (d) 


Fig. 6. (a) A linear system of equations; (b) its regularization; (c) the graph corre- 
sponding to (b); (d) a closed-form solution for (a). 


Definition 5. A Kronecker algebra Kr = ((A,+,-,*,0,1), (T,9,©,®,0, 1), 
@,4) is a tensor-product algebra that consists of two Kleene algebras 
(A,+,-,*,0,1) and (T,8,©,®,0,1) such that (i) the natural order forms a com- 
plete lattice (i.e., both algebras have all infinite sums), and (ii) the following 


properties hold: 


.0@0=0 

1@l1l=1 

(a@b)? =a-b, for alla,b € A 

(a1 ® b1) © (az Q b2) = (ag - a1) Q (bı - b2), for all a1, a2, b1,b2 E€ A 
(tı @ to)! = tf +48, for all t1,t2 E T 


as wer 


We assume that all distributivity properties of A and T, as well as item 5, hold 
for infinite sums. In particular, for item 5, we have 


le) -5 ao) 


iel ier 


4.3 Discussion 


The Instantiation of the Recipe. Returning to the recipe from Sect. 3.3, what we 
have done for a system of linear equations ELin is to instantiate the recipe as 
follows: 


1. (Modeling). The concrete semantics is the least solution of ELin interpreted 
in relational semantics. 

2. (Closed forms). Each variable of Etin is expressed as the detensor ((—)*) of 
a tensored regular expression. Closed forms are computed from the closed- 
forms of the left-linear system of equations TReg(ELin) that results from the 
regularizing transformation (e.g., see Fig. 6). 
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3. (Interpretation). Tensored regular expressions can be interpreted as tensored 
transition formulas (Example 11), which are simply transition formulas over 
a “doubled” vocabulary. 


Two Lessons. We would like to mention two lessons that we learned while work- 
ing on this material over the years. 


1. For the problems that arise in NPA, we must solve an equation system that 
is truly linear, not left-linear or right-linear. A reasonable sanity check might 
go as follows: 

— Algebraic program analysis à la Sect. 2 solves a left-linear (or right-linear) 
system of equations using methods based on regular expressions. 

— NPA repeatedly creates a system of linear equations that needs to be 
solved. Such linear equations are related to linear context-free languages, 
such as the language {a‘b’}, which is not regular. 

— Ergo, it is a non-starter to attempt to apply algebraic program analysis 
to the equations that arise on each round of NPA. 

However, as shown in this section, it was possible to side-step this fundamen- 
tal mismatch, by extending algebraic program analysis to systems of linear 
equations using Kronecker algebras, which have additional operations, such 
as tensor product and detensor. 

Thus, beyond the technical details, perhaps a more important takeaway is “be 
careful how you apply sanity checks.” There is a risk that a plausible-sounding 
sanity check could cause you to discard an idea that is worth pursuing. 

2. In some sense, the solution using Kronecker algebras goes against the grain of 

what computer scientists typically preach, namely, create appropriate abstrac- 
tions (in the sense of abstract data-types) for a problem at hand, and then 
program your solution, thinking of the chosen abstractions as the operations 
of an abstract machine. This style of thinking is considered central to man- 
aging complexity in computer science, and it is generally considered heresy 
to break an abstraction. 
For algebraic program analysis, the abstraction is regular algebra, used with 
interpretations that are abstractions (in the sense of abstract interpretation 
[22]) of a program’s concrete transition relations. However, the introduction 
of tensor product and detensor breaks that abstraction! To understand what 
we mean, consider the definition of F - G for transition relations in Boolean 
programs, 1.e., 


(F -G)(W, Z) & 3X, Y.F(W, X) A G(Y,Z) A (X =Y), 


and the definitions of F & G and T?,° namely, 


(F 8 G)(W, X,Y, Z) © F(W, X) A G(Y, Z) 
T(W, X,Y, Z)t £ 3X,Y.T(W, X,Y, Z) A (X =Y) 


WwW 


6 Because we are trying to relate these operations to the untensored product operation 
-, we do not make use of the stacked notation from Sect. 4.2. 
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The product operation F -G has three distinct steps: (i) conjoin F'(W, X) and 
G(Y, Z); (ii) conjoin the equality X = Y; and (iii) project out vocabularies X 
and Y. In essence, tensor product and detensor break the abstraction of - as 
an indivisible operation: - is decomposed into two more-granular operations, 
® and 4. By performing F & G, we perform just the first step of -, and only 
later, when 4 is performed, do we “finish up” by applying the second and 
third steps of -. The advantage is that we can operate on tensored values for 
some number of steps before “finishing” some earlier -. 
Again, beyond the technical details, the takeaway may be the process that we 
went through, which may be of value as a conceptual tool in other contexts: 
— The insight on how to break the abstraction—both as presented here 
and as occurred during our research seven or eight years ago—came from 
thinking about one specific interpretation of Kleene algebra: transition 
relations for Boolean programs. 
— The algebraic properties of the new, finer-granularity operations allowed 
us to abstract out a new algebra, dubbed in this paper Kronecker algebra. 
— The ideas could now be applied in other contexts by finding other inter- 
pretations of Kronecker algebra (or, because we are interested in program 
analysis, by finding interpretations that over-approximate Kronecker alge- 
bra). 


5 Termination Analysis 


This section describes how algebraic program analysis can be applied to termi- 
nation analysis, based on the approach of [63]. The goal of termination analysis 
is to prove that a program has no infinite executions. Our high-level strategy is 
to exploit compositionality: we prove that a loop terminates by first computing a 
summary (e.g., a transition formula) for its body, and then finding a termination 
argument for the summary. 

Following Sect.3, we first formalize a concrete semantics as the (greatest) 
solution of a system of semantic equations. An appropriate notion of concrete 
semantics for termination analysis is the set of non-terminating states of the 
program (from which there exists an infinite execution)—the program terminates 
exactly when none of the program’s initial states belong to this set. As in Sect. 3, 
this system of equations can be derived syntactically from a program’s control 
flow graph—see Fig. 7 for an example. The non-terminating states of the program 
are the greatest solution to this system of equations over the algebra where the 
universe is the set of states, H is interpreted as union (a state is non-terminating 
if it has at least one infinite execution) and © is interpreted as preimage (a state 
is non-terminating iff it can reach a non-terminating state).” 


T Despite the fact that this system of equations is right-linear, the method of Sect. 2 
y 8 , 
does not apply because the system of equations has two sorts instead of one; in 
particular, Œ has type E : 25ttex State „ gState _, gState and so is not a binary operation 
on a set. 
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while (i < n A j > 0) do 
f : k := nondet() 
et oe j:=j+k-1 

a while (k > 0) do 

l k:=k-1 


k := fa [k < 0] i := i+1 

b 

l (b) 
j:j+k-1 

A X, = (r,a) O Xa 

s Xo = (b,c) O Xe 

d i:=i+1 Xe = (l(c, r) © X») B ((c, d) D Xa) 

ks A Xa = (d,e) Xe 
y Xe = (e,c) O Xe 
( 


v 
par 
= 
z] 
s 


Fig. 7. A program represented by a control flow graph (a), abstract syntax tree (b), 
and system of equations (c). 


A suitable language of “closed-form solutions” for the system of equations 
that arise in termination analysis is w-regular expressions. The syntax of w- 
regular expressions over an alphabet » is as follows: 


aes 
R € RegExp() s=alO]1 Ri + R2 R,- Ro | R* 
S € w-RegExp(’) ::= RY |S, AS | RES 


The semantics of a (w)-regular expressions is given by an interpretation over an 
w-algebra and a regular algebra. 


Definition 6. An w-algebra over a regular algebra A is 4-tuple B = 
(B, JP B w? ) consisting of a universe B, an operation GB? : Ax B— B, an 


= 


operation BH? : B x B => B, and an operation (—\" :A >B. 


Example 12 (Standard interpretation). In the standard interpretation of w- 
regular expressions, the universe consists of sets of infinite sequences over the 
alphabet X, and the operations are 


Wi W2 = Wi U Wo Union 
XOW={ew:r€e€ X,we W} Concatenation 


XY Ê {x H9+++ 221, £2, E X} Infinite repetition 
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For example, an w-regular expression that recognizes all infinite paths in Fig. 7a 
starting at r is: 


Outer loop 
((r, a) (a, b) (b, c) ((e, d) (d, e) (e, €))* (e,r))” 
(((r, a) (a, b) (b, c) ({c, d) (d, e) (e, c))* (c,r))” (r, a) (a, b) (b, c)) E ((c, d) (d, e) (e, c) )® 
e e 


Inner loop 


3 


Example 18 (Nonterminating state interpretation). The non-terminating state 
algebra is an w-algebra over the algebra of state relations. Its universe consists 
of sets of states. The operators are 


ROS &= {a: Fy. (t,y)€e RAyE S} Preimage 
Sy So £ Sı U So Union 


IJTI, T2,- 


w A X 
R= {0 E€ State: vi loi OR 


\ Non-terminating states of R 


al 


Tarjan’s path expression algorithm can be adapted to compute an w-regular 
expression that recognizes the set of infinite paths in a graph beginning at a 
particular node [63]. The equational view of this algorithm is that it computes 
closed-form solutions to right-linear equations over Btichi algebras (e.g., the alge- 
bra of non-terminating states). 


Definition 7 (Biichi algebra). A Biichi algebra is an w-algebra over a Kleene 
algebra satisfying the following: 


Sı Œ (S2 ŒE S3) = (S1 E S2) E S3 Associativity 

S1 E So = SoS; Commutativity 

SHS=S Idempotence 

((Ri- Ro) OS) = Ri O (R: 0 8) Compatibility 

((Rı + R2) O S) = (Ri O S) E (R2 0 8) Right-distributivity 

Ro (S1 E S2) = (RE S1) E (RE S2) Left-distributivity 

R” = RG R” Unfold 

Sı < (Ro S1) E So SS, < R” E (R* 0 S2) Coinduction 


where < is the order defined by a < b iff a Hb = b. 


Exercise 2. Show that in any Büchi algebra, the greatest solution to the equation 
X = (a 0 X) Æ z exists and is equal to X = a” Œ (a* Oz). 


Summarizing: we have modeled a program’s non-terminating states as the 
greatest solution to a system of semantic equations, devised a language of “closed 
form solutions” , and identified an algorithm for computing closed form solutions 
to the equations. It remains only to develop abstract interpretations of the lan- 
guage of closed forms which implements termination analysis. 
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5.1 Non-terminating State-Formula Interpretations 


Just as transition formulas (over variables X and X’) can be used to represent 
state relations, state formulas (over the variables X) can be used to represent 
sets of (non-terminating) states. We can extend an algebra of transition formulas 
to an algebra of non-terminating state formulas by defining 


Pe PSA (Xx, x PX) Preimage 
P, B P = P V P Union 


Intuitively, the w operator should compute the set of non-terminating states 
of a transition formula. Analogously to the * operator in Sect. 2, this set is 
uncomputable, and we must be satisfied with an over-approximation (i.e., we 
aim to compute a state formula that contains all non-terminating states—the 
soundness relation of interest is the one defined by N IF S => Vs € N.s H S). 
There are many ways of doing this, so we speak of the family of non-terminating 
state formula interpretations. In the remainder of this section, we give examples 
of w-operators. 


Example 14 (Linear-lexicographic ranking functions [32]). Let F(X, X’) be a 
transition formula. A linear lexicographic ranking function (LLRF) for F is a 
sequence of linear terms tı,...,tn over X such that for any states s and s’ 
such that s >, s’, each t; evaluates to a non-negative integer in s, and the 
integer n-tuple decreases in lexicographic order going from s to s’. Since there 
are no infinite strictly descending chains of non-negative n-tuples of integers 
with respect to the lexicographic order, if F has an LLRF, then F has no non- 
terminating states. For example, the inner loop of Fig.7 has a 1-dimensional 
LLRF (k), and the outer loop has a 2-dimensional LLRF (n — i, 7). 

The problem of determining whether a linear integer arithmetic formula has 
an LLRF is decidable [32]. If a formula does not have an LLRF, then we can use 
a coarse over-approximation of the non-terminating states of a formula (e.g., the 
set of states that have at least one outgoing transition). This yields the following 
interpretation of the w operator: 


AX'.F(X,X’) otherwise 


es false if there is an LLRF for F 


For Fig. 7, using recurrence analysis to implement the « operator (Example 5), we 
get that every non-terminating state must satisfy false—the program terminates 
from any initial state. d 


Example 15 (Unbounded trajectories [63] ). Let F(X, X’) be a transition formula. 
A necessary (but not sufficient) condition for a state s to be a non-terminating 
for a transition formula F is that there is a computation of F starting from s for 
every possible length. This condition is undecidable, but it can be approximated 
using an approximate transitive-closure operator such as the ones in Sect. 2.1. 
Suppose that (—)* is an over-approximate transitive-closure operator. Letting k 
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and k’ be symbols that do not appear in F, we can create a transition formula 
exp(F) in one parameter k’ such that for any k’, if there exists a sequence 
S1 >F 82 >P > p Sp, then 81 >exp(F) Sk! 


exp(F) (FAK =k+1)*[kH 0] 


The states s for which there exists a computation $s —>exp(F) 8’ — s” for all 
choices of the parameter k’ over-approximates the set of non-terminating states 
of F: 

FY £ Yk > 0.3X', X". exp(F) A F(X’, X”) 


For example, if x is instantiated to recurrence analysis (Example 5), then on 
the transition formula 


FSix¢nAt Hi+2An=n 
(corresponding to the program while (i #4 n) do i := i + 2), we have 
FY =i>nVv(n—1t) mod 2=1 =] 


Additional examples of termination analyses in the algebraic framework 
appear in [63] and [62]. 


5.2 The Instantiation of the Recipe 
The recipe from Sect. 3.3 is instantiated for termination analysis as follows: 


1. (Modeling). The concrete semantics is the set of non-terminating states, which 
is the greatest solution to a system of right-linear equations. 

2. (Closed forms). The language of closed-forms is given by w-regular expres- 
sions; they can be computed by a variation of Tarjan’s algorithm [63]. 

3. (Interpretation). An w-regular expression can be interpreted as a state formula 
representing a set of possibly non-terminating states, while regular expressions 
are interpreted as transition formulas (Sect.2). The soundness relation is 
over-approximate: we can prove that a program terminates by finding an 
unsatisfiable pre-condition, but the analysis cannot prove non-termination. 


6 Recap 


This section contains a few remarks about commonalities among the three kinds 
of problems and the techniques we have presented for applying algebraic program 
analysis to them. The paper has been structured around the three-part recipe 
for algebraic program analysis given in Sect.3.3. Table 1 recaps how the recipe 
has been instantiated for the three kinds of problems considered. 

Within this paper, all methods for computing closed-form solutions can be 
understood as some variation of Gaussian elimination, Algorithm 1 (in prac- 
tice, they are variations of Tarjan’s path-expression algorithm). The essential 
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Table 1. Instantiations of the recipe for algebraic program analysis from Sect. 3.3. 


Section 3.3 Section 4.3 Section 5.2 
Analysis type Intraprocedural Linear interprocedural | Termination 
Modeling LFP of left-linear LFP of linear GFP of right-linear 

equations equations equations 


A regular expression 
(path expression over the 


CFG) 


Closed-form solution 


Detensor of a tensored 
path expression 


An omega-regular 
expression 


A Kleene algebra 
(Definition 1), e.g., 
transition relations 


(Sect. 3.1) 


Interpretation (concrete) 


A Kronecker algebra 
(Definition 5), e.g., 
tensored transition 
relations 

(Example 10) 


A Biichi algebra 
(Definition 7), e.g., 
non-terminating 
states (Example 13) 


A regular algebra 
(Sect. 2), e.g., a 
transition-formula 


Interpretation (abstract) 


interpretation (Sect. 2.1) 


A tensor-product 
algebra (Definition 3), 
e.g., a tensored 
transition-formula 
interpretation 
(Example 11) 


An w algebra 
(Definition 6), e.g., a 
non-terminating 
state-formula 
interpretation 

(Sect. 5.1) 


Table 2. “Loop-solving” steps. 


Equation type Form of “loop” Closed form for X 
Left-linear X=a+Xb ~~ | ab* 

Linear X =a +5 bX |~ | ((1@a) © (Ox, bi 8 ci)®)* 
Right-linear =X = bmn X)Hz ~ |a” E (b* 0 z) 


difference between Sect. 2, Sect. 4, and Sect. 5 is the “loop-solving” step. Each 
requires the right-hand-side expression R to be in a particular form (left-linear, 
linear, right-linear), and each requires a different language of expressions in which 
to express closed forms (regular, tensored regular, w-regular). Table 2 shows 
the respective “loop-solving” steps for computing a closed form. Note that in 
Table 2, the letters a,b;,c;,z range over expressions (which may involve vari- 
ables other than X). For example, to apply the left-linear rule to the equation 
X = Xp+ Xq+ Yr + Z, we first re-arrange terms on the right-hand side as 
X(p+q)+ (Yr + Z) and then compute the “closed-form” (Yr + Z)(p + q)*. 


7 Related Work 


Abstracting States Versus State Changes. Classically, invariant generation is con- 
ceived as the problem of over-approximating the reachable states of a program. 
Computing invariants involves solving a system of equations of the form 


X(r] = vr 
= 5 F [m,n] (X[m]) 


€m,n € Edges 


r € Nodes, the root node 


n € Nodes — {r} (11) 


for the unknowns X[n], n € Nodes, where vy represents the set of initial states 
and .¥[—] provides an interpretation of each CFG edge as a state transformer. 
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In a solution, X[n] holds a descriptor that represents a superset of the set of 
program states that can arise at program point n. Note that in Eq. (11), the 
function J [enn] on edge em,n is applied to the value X[m] on node m. 

Algebraic program analyses, in contrast, concern dynamics—state changes— 
rather than states. The reason is that algebraic analyses are compositional: states 
do not compose, but state changes do. 

A first step towards abstracting state changes was taken by Graham & Weg- 
man [33], who gave a method to solve dataflow equations via composition of the 
state transformers on CFG edges. That is, their basic primitives were (i) com- 
position of functions, and (ii) union of functions. If we adopt this outlook and 
define rı - r2 to be r20 r1, rı + T2 to be the union of rı and r2, and 1 to be the 
identity function, instead of Eq. (11), the goal would be to solve the following 
equation system: 


X[|r]=1 r € Nodes, the root node 
X[n])= XO Xfm] Alemn] nE Nodes — {r} (12) 


em,nEEdges 


where the unknowns X[n] are now function-valued. Note that the function 
I emn] on edge Em n is composed with the value X[m] on node m. From here— 
because one is working over function-valued quantities—it is now natural to for- 
mulate interprocedural program-analysis problems by means of equations over 
unknowns that denote procedure summaries, as was done by Cousot and Cousot 
[23] and Sharir and Pnueli [56]. 


“Interpret, Then Solve” Versus “Solve, Then Interpret.” The systems in 
Eqs. (11) and (12) are interpreted, in the sense that they are understood as 
semantic equations valued over a particular abstract domain, say D. Such a 
system Æ = {X; = Ri}ier can be solved by an iterative method: compute a 
sequence 09,01,°°: € {Xiher > D of assignments abstract domain values to 
variables 
oo(X;) £0 for allie I 
Onsi(X;) £ %, [Ri] for all n > 0 andali € 7 


Eventually this process converges—typically with the aid of widening to 
extrapolate to the limit—upon an assignment that over-approximates the least 
solution to E. 

In algebraic program analysis, we think of a system of equations as an unin- 
terpreted (syntactic) object. Equations are solved symbolically and then the 
solutions are interpreted in an algebraic structure to obtain an analysis result. 
The key step in this direction was made by Tarjan [59], who observed that 
once a solution to the path-expression problem was in hand, multiple dataflow- 
analysis problems could be solved merely by reinterpreting the alphabet symbols 
and operators of regular expressions in different algebras—i.e., “solve and then 
interpret.” 

Whereas the iterative framework for program analysis has a “built-in” algo- 
rithm for analyzing loops and recursive behavior (by computing the limit of a 
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sequence), the algebraic framework does not prescribe any particular method, 
and it is up to the analysis designer to devise one. This obligation places an addi- 
tional burden on the analysis designer, but also provides flexibility: the analysis 
designer may analyze loops in ways that may (Example 6) or may not (Exam- 
ples 5 and 4) resemble iterative fixpoint computation. 


Iteration Operators and Loop Summarization. In the computer-aided-verification 
community, there is a body of literature on loop summarization (or “loop leap- 
ing”) and acceleration. Summarization aims to compute or approximate the 
behavior of (certain) loops, while acceleration aims to approximate the postim- 
age of a set of states under a loop. These techniques have been incorporated 
into iterative abstract interpretation [28,31], abstraction-refinement-based soft- 
ware model checking [19,37], termination analysis [7, 20,60], and resource bound 
analysis [10,64]. The most closely related techniques to algebraic program anal- 
ysis are those that build summaries for whole programs in “bottom-up” fashion. 
Such analyses have been formalized in various ways, including: recursion on the 
abstract syntax tree (AST) of a program [51], AST rewriting [8], and graph 
rewriting [47,60]. Algebraic program analysis provides a unifying foundation for 
such analyses, in the same way that dataflow analysis [39] and (iterative) abstract 
interpretation [22] provide a unifying foundation for iterative program analyses. 

There are several methods for loop summarization, based on finite-monoid 
affine transformations [11,12,29], difference-bound relations [15,21], octagonal 
relations [13, 14,45], integer vector addition systems [35], fragments of the theory 
of arrays [2]. For the most part, these summarization methods are non-uniform 
in the sense that their input language differs from their output language (e.g., 
[13] takes as input an octagonal relation and produces as output a Presburger 
formula). This non-uniformity is the essential barrier that must be overcome to 
use such techniques to implement the iteration operator of an algebraic program 
analysis (e.g., we can define an iteration operator by using optimization modulo 
theories [55] to extract the octagonal hull of a Presburger formula, then use [13] 
to compute a Presburger formula representing its transitive closure). 


Elimination-Based Dataflow Analysis. Elimination-based dataflow analysis is a 
family of dataflow analyses that computes analysis results using methods that 
resemble Gaussian elimination [3,33,36] (see [54] for a survey). Early methods 
were specialized to reducible control flow graphs, but operated faster than general 
Gaussian elimination. Tarjan’s algorithm [58] is an elimination method with 
fast operation on reducible (and “nearly reducible”) control flow graphs, but is 
applicable to arbitrary graphs. 


Weighted Graphs. There is a vast literature on solving path problems on 
weighted graphs where the weights are drawn from a semiring [1,30,50]. Path 
problems can also be solved on semiring-weighted pushdown systems, which has 
applications to interprocedural dataflow analysis [52]. This work focuses on iter- 
ative techniques for solving path problems. 
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(Non-iterative) algorithms for path problems over algebraic structures with 
an explicit iteration operator were considered by Aho et al. [1], Backhouse & 
Carré [5], and Lehmann [48], and was implicit in previous work by Kleene [44], 
and McNaughton & Yamada [49]. Tarjan connected this line of work with pro- 
gram analysis [58,59]. 


8 Open Problems 
We conclude with a list of challenges suggested by algebraic program analysis. 


Scaling SMT-Based Algebraic Program Analysis. The bottom-up interpretation 
step of a closed-form expression is efficient, in that it operates in linear time and 
space in the size of the expression DAG in a model where each algebraic operation 
has unit cost. For logic-based interpretations, however, algebraic operations do 
not have unit cost: operators manipulate formulas, and the size of those formulas 
may grow as operators are applied. For example, the regular expression a?” can 
be represented by an expression DAG with n+1 nodes, with the following shape: 


>. e—a >. > 
> > > > 5 a 


If the letter a is interpreted as the transition formula x’ = x +1 and - as 
relational composition, then the transition-formula interpretation of a?” has size 
O(2”). Scaling SMT-based algebraic program analysis to large programs requires 
techniques for generating succinct summaries, and/or efficient reasoning about 
compact formula representations involving \-expressions. 


Recursive Procedures. Section 4.2 shows how the algebraic approach can be 
applied to summarize linearly recursive procedures. But to compute sum- 
maries for generally recursive procedures, current-generation algebraic-program- 
analysis tools fall back on another non-algebraic scheme (such as hybrid itera- 
tive/algebraic, like Kleene or Newton iteration [40,53], or the template-based 
approach of [16]). This raises the question: is there a practical algebraic method 
for analyzing general recursion? The essential challenge is in devising a language 
of “closed forms” that (1) can represent arbitrary context-free languages, and 
(2) is amenable to an effective interpretation in logic. 


Beyond Numerical Domains. To date, all algebraic program analyses have been 
numerical in nature—they abstract away aspects of program behavior that can- 
not be captured by integer variables. It remains to be seen whether the algebraic 
approach can yield practical analyses for reasoning about features like strings, 
arrays, and the heap. Reasoning about memory manipulation is particularly 
challenging in a compositional setting, since we cannot rely on the context of 
a program fragment to resolve aliasing relationships. One possible avenue is to 
incorporate abductive reasoning to make educated guesses about the shape of 
memory, as in [18]. 
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Property Refutation. Algebraic program analysis is typically conceived as 
a method for generating over-approximate summaries. The nature of over- 
approximation is that the summaries can be used to verify that a program 
does satisfy a property of interest, but not prove that it doesn’t. An interest- 
ing direction for future work is to devise methods by which algebraic program 
analyses can refute properties, perhaps based on bounded model checking [9], 
under-approximate loop summarization [46], or symbolic execution [43]. 
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Abstract. Program synthesis is now a reality, and we are approaching 
the point where domain-specific synthesizers can now handle problems 
of practical sizes. Moreover, some of these tools are finding adoption 
in industry. However, for synthesis to become a mainstream technique 
adopted at large by programmers as well as by end-users, we need to 
design programmable synthesis frameworks that (i) are not tailored to 
specific domains or languages, (ii) enable one to specify synthesis prob- 
lems with a variety of qualitative and quantitative objectives in mind, 
and (iii) come equipped with theoretical as well as practical guarantees. 
We report on our work on designing such frameworks and on building 
synthesis engines that can handle program-synthesis problems describ- 
able in such frameworks, and describe open challenges and opportunities. 


1 Introduction 


1.1 A Synthesis Tale 


Monica, a software engineer, is trying to write a program for transforming data 
she has stored in an array of integer numbers. Monica needs to zero-out all the 
negative entries from the array (they represent irrelevant data points) and add 
10 to all the positive entries (this is a normalization step needed in Monica’s 
API). Of course, Monica is a great engineer and she could write this program 
herself, but since Monica knows that similar problems arise often in her company 
(i.e., reformatting arrays to match certain APIs), Monica decides to try out this 
new thing everyone is talking about: program synthesis. 

Monica wants a tool that takes as input some examples of the desired 
transformation and a set of operators the program can use, and magically 
outputs the intended program. In fact, Monica already has an input, a 
unit test, that she wants to process using her newly synthesized program: 
[—1, 2,3, 10,31, —14, —11], for which the output should be (0, 12, 13, 20, 41, 0, 0}. 

Monica also knows that the final program will look like a loop that iterates 
over the input array arr, which leads her to develop the grammar in Fig. 1. 
Monica thinks this grammar is general enough that it will cover a reasonable 
range of programs for similar tasks but limited enough that it will not result in 
spurious programs that overfit too much to the examples. 

Quickly, Monica discovers that using program synthesis is not so straight- 
forward. There are so many different tools! And they all take different kinds of 


© The Author(s) 2021 
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Start > x = len(arr) - 1; while x>=0 do S$ 
S > arr[E] = arr[E] + E |arr[E] = E | 
x= E|S; S|if arr[x]>0 then S else S 
EoOO|1i|/x|B+E|E-E 


Fig. 1. The grammar Gers Monica has in mind for synthesizing programs that iterate 
over an input array (Start is the starting nonterminal). Gex is general enough to cover 
most programs that iteratively normalize entries in an array. 


inputs. After a bit more research, Monica decides to go for one of the many 
tools, UltraSynth™, and encodes her problem. UltraSynth is written in a C-like 
language and Monica has mostly programmed in Python for her job. However, 
Monica decides to give UltraSynth a try and after a few days of learning the 
ins and outs of UltraSynth, she finally manages to encode her transformation- 
synthesis problem in UltraSynth. To achieve her goal, Monica had to tweak a bit 
what the grammar looks like to provide it to UltraSynth, which only accepted 
grammars without unbounded recursion (i.e., without infinitely many terms) 
and had to encode the examples in a way that was accepted by the tool. 

The time has come and Monica manages to run UltraSynth on an instance 
of the synthesis problem. UltraSynth outputs the program in Fig. 2b, which is 
correct on the example. However, this program is needlessly large and contains 
many unneeded operations. 


x = len(arr)-1 x = len(arr)-1 
while x>=0 do while x>=0 do 
if arr[x]>0O then if arr[x]>0O then 
arr[x] = arr[x]+10 arr[x] = arr[x]+1-1...+1 
else else 
arr[x] = 0 arr[x] = 0 
x = x-1 x = x-l 
(a) A possible ideal solution for (b) A solution for Monica’s problem 
Monica’s problem. synthesized by UltraSynth. The se- 


quence of +1s on line 3 adds up to +10. 


Fig. 2. Two possible solutions for Monica’s synthesis problem. 


Monica has already invested a lot of time in learning UltraSynth, so she 
tries to figure out a way to avoid such problematic programs. Monica astutely 
realizes that the needless computations in Line 3 of Fig. 2b are due to repeated 
applications of the minus operator. Monica would like to ask UltraSynth to 
synthesize the program that contains as few minus operators as possible, but 
UltraSynth does not support a way to “prefer” one possible program over another. 
To bypass this limitation, Monica decides to remove the production E — E - E 
in order to suppress these programs. 


86 L. D’ Antoni et al. 


Monica reruns UltraSynth after removing E — E - E from the grammar, 
and to her surprise, UltraSynth continues to run for hours and eventually times 
out without providing a solution. After investigating the matter, Monica finds 
out that she has made a mistake and disallowed too many programs—there is no 
longer a valid solution to the synthesis problem because without subtraction, the 
variable x cannot be decremented in line 5. UltraSynth was unable to report, or 
even detect this simple mistake—Why is it so difficult to program a synthesizer 
and why can’t synthesis tools detect the simplest of mistakes? 

Monica has finally had enough of synthesis. She goes back to her daily rou- 
tine and just writes the 7-line piece of code that applies the transformation she 
intended (Fig. 2a). 


1.2 Programmable Synthesis Frameworks 


The story of Monica is a common one in program synthesis, where most of the 
recent focus has been on solving problems rather than building general algo- 
rithms, tools, and methodologies. Existing synthesis frameworks are not pro- 
grammable as they lack at least one of the following properties: 


Domain-Agnostic. Existing synthesis ideas and algorithms have been intro- 
duced with specific domains in mind and are hard to apply to arbitrary 
synthesis problems. The languages used to specify synthesis problems are 
therefore domain-specific, and often fail to abstract the logical requirements 
of the synthesis problem. In our example, Monica had to look for a specific 
tool that accepted programs of the kind she was interested in. Moreover, she 
was not permitted to refine the specification to add a quantitative objective 
she had in mind (minimizing the number of minus operators). 

Solver-Agnostic. Different synthesis tools are typically not interchangeable 
because their underlying solvers solve different types of problems. Even when 
two solvers can in principle solve the same types of problems, they typically 
cannot be interchanged or combined because they typically use drastically 
different formats written in different languages (e.g., Racket [28] vs. C [27]). 
For example, when Monica found out that UltraSynth was not working as 
expected, she could not easily try another tool to see if that tool was better. 


This state of affairs is unfortunate because synthesis is very general; if synthe- 
sis were easier to use, it would benefit many domains. The potential generality, 
which is currently held back by the need for better support for usability, under- 
scores the need to answer the following question: 


Can we make synthesis more programmable? 


In this paper, we present the steps we have undertaken in the direction of making 
synthesis more programmable, including some of the challenges that we faced, 
and some of the opportunities that the work has opened. 
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2 An Overview of Programmable Program Synthesis 


The goal of enlarging the scope of synthesis has focused our attention on the 
need to have a framework in which synthesis problems can be addressed. By a 
framework, we mean the conceptual underpinnings that allow one to build tools 
to automate the creation of solutions for problems in some domain, in this case, 
program-synthesis problems. The canonical example is how the theory of parsing 
underlies the yacc tool [13], which automates the construction of parsers. For 
instance, consider the problem that yacc addresses: 


— An instance of a parsing problem, Parse(L,s), has two parameters: L, a 
context-free language; and s, a string to be parsed. String s changes more 
frequently than language L. 

— Context-free grammars are a formalism for specifying context-free lan- 
guages. 

— Create a tool that implements the following specification: 

e Input: a context-free grammar that describes language L. 
e Output: a parser, yyparse(), such that invoking yyparse() on s com- 
putes Parse(L,s). 


One consideration for building a framework is the existence of a well-defined 
“engine” (or collection of engines) for performing the desired task—in this case, 
parsing s with respect to L, once both L and s are at hand. Yacc supports 
just a single engine, which parses a string with respect to a grammar that is 
LALR(1). In principle, yacc could have been a more general tool by having it 
perform various tests on L to determine what grammar family L belongs to (e.g., 
LALR(1), LR(1), LL(1), LL(*)), and then emitting a parser that makes use of 
an appropriate parsing algorithm for that family, falling back on Generalized LR 
parsing [19] in case L is not in one of the specialized families supported. 

Another aspect illustrated by yacc is that the parameters to the problem 
have different “binding times”. In this case, string s changes more frequently 
than language L—i.e., L is bound early, and s is bound late. The framework 
implementation can exploit the known value of the early-bound parameter to 
create a more efficient implementation. In the case of yacc, it compiles L to 
tables used by a table-driven LALR(1) parsing algorithm. 


2.1 Why Isn’t Existing Work in Synthesis Programmable? 


There do exist synthesis tools (mostly, solver-aided languages [27,28]) that allow 
one to control some aspects of a synthesis problem in a programmable fashion. 
However, the nature of existing synthesis tools also forces an association between 
how a synthesis problem is written and how it is solved. For instance, in Sect. 1, 
the fact that solvers are tightly coupled to some specification language prevented 
Monica from trying out a different tool after UltraSynth produced an inadequate 
answer. The current state of program-synthesis tools is depicted in Fig. 3. 
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Domain-Specific and Solver-Agnostic Non-Interchangeable 


Synthesis Problems Specification Solvers 


Enumeration-based solver 


Synthesize Expressions 
Constraint-based solver 


(CVC4Sy) 
Synthesize Imperative Sketch 
Programs 
Rosette 


Synthesize string 
transformations 


+ Domain-specific solver 


Fig. 3. Program synthesis today, where the lack of separation between specification 
and solver causes a user to have to encode a problem multiple times to use different 
tools. 


This situation is in direct conflict with the principles articulated at the end 
of Sect. 1, namely, that a user should be able to program the various aspects 
of a synthesis problem using a formalism that is both (i) domain-agnostic and 
(ii) solver-agnostic. The first property addresses generality: the formalism should 
be powerful enough to capture a wide variety of synthesis problems (e.g., SQL, 
regular expressions, and imperative programs). The second property opens the 
door for synthesis-problem specifications to be fed—possibly after a compila- 
tion/translation step—to different specialized solvers, or to multiple solvers with 
different capabilities. 

Another example that one may consider a programmable framework is 
Syntax-Guided Synthesis (SyGuS) [1], which is a successful synthesis frame- 
work targeted at expressions. The defining characteristic of SYGUS, compared 
to other synthesis approaches such as solver-aided languages, is that it allows 
one to write synthesis problems in a completely logical format. 


Example 1. Consider the simple problem of synthesizing the maximum maz 
of two input variables, x and y. There are two parts to a SYGUS problem: a 
syntactic part, written as a context free grammar such as the example Gs below: 


Gg : = Start > x | y | Start + Start | if x<y then Start else Start 
and the specification part, which is written as a Boolean formula wg: 
ps = Yx, y.maz(x, y) > «A maz(z, y) > y A (max(x,y) =x V maz(z, y) = y) 


A SYGUS problem is simply the pair sy = (Gs, Ys), where a solution to the 
SYGUS problem sy is a term t € L(G) such that Ys holds. For example, the fol- 
lowing term is a solution for the function maz in the problem we just described: 


if x<y then y else x. 
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The advantage of such a logic-based formalism is that it achieves a separa- 
tion from solver and specification, which allows SyGuS to be solver-agnostic. 
Several different SyGUS solvers have been developed (e.g., [4,7,21,26]), many of 
which use drastically different internal algorithms that have different strengths 
for solving different kinds of problems. Moreover, a user of SYGUS need not con- 
sider the differing input languages or characteristics of these solvers, and instead 
can encode their problem just once in the SyYGUS format to have access to all 
the different solvers. 

While SyGUS achieves—and shows the benefits of—solver-agnosticity, it fails 
to achieve domain-agnosticity because the framework is targeted specifically at 
expressions. For example, consider the scenario from Sect. 1: Monica would be 
unable to encode her problem in SYGUS, because the grammar Ger in Fig. 1 
contains a production with a while loop, and loops, which require a custom 
semantics, cannot be expressed in any decidable theory—a key restriction of 
SyGus. SYGUS also does not allow one to express intent outside of the behav- 
ioral specification y, which would have prevented Monica from trying to optimize 
the program obtained from UltraSynth in Fig. 2b. 

All in all, the current state of program synthesis is an unsatisfactory mess, 
as depicted in Fig. 3. There are multiple non-interoperable solvers with different 
input languages, targeting different synthesis domains with varying degrees of 
overlap. SYGUS, by virtue of solver-agnosticity, provides a unified approach to 
synthesizing expressions, which forms the basis of multiple solvers. However, 
while SYGUS is a bright spot, it fails to be general: it does not cope with (1) the 
variety of domains used in synthesis, required to deal with arbitrary languages 
(e.g., SQL, regular expressions, and imperative programs), and (2) the variety of 
collateral considerations that arise for different domains (e.g., types, quantitative 
objectives, and probabilities). 


2.2 What Does a Programmable Synthesis Framework Look Like? 
Our vision of programmable synthesis can be summed up as follows: 


programmable synthesis 


easily instantiable, domain-agnostic, solver-agnostic synthesis framework. 


In contrast with Fig. 3, what we would like to have is depicted in Fig. 4, where 
both user and solver work with a unified general format, regardless of domain 
or solving technique. Such an approach would allow one to specify a synthesis 
problem once and for all, without having to worry about the underlying solving 
strategy. To achieve this goal, it is necessary to distill out the essence of 
many program-synthesis problems into a specification formalism that is ground 
in formal methods (e.g., automata and logic) and is agnostic to any specific 
domain of application. This degree of abstraction also opens the opportunity 
to lift certain synthesis algorithms and ideas to a higher level that makes these 
algorithms reusable across different tools. Our framework can then interface to 
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Domain- and Solver-Agnostic 
specification 


Enumeration-based solver 
Programmable synthesis 


framework 


Synthesis Problems Interchangeable Solvers 


Synthesize Expression Constraint-based solver 


Semantics-Guided Synthesis 
Synthesize string MCMC solver 


transformationin DSL ~~, 


titative Objecti 
Quantitative Objectives Domain-optimized solvers 


Synthesize Loops n] (e.g., for regular expressions) 


Probabilistic Constraints 
Unrealizability solver 
Fig. 4. Programmable program synthesis, where a synthesis problem with arbitrary 


constraints can be written once and for all in a general format, which can then be 
dispatched to compatible solvers. 


different solving tools (backend solvers) in a way that allows one to easily swap 
one solver for another, or to use multiple solvers in tandem. If our vision is 
achieved, the capabilities that would be available to tool designers—discussed 
in greater detail in Sect. 5—would allow synthesis tools to be created that have 
the kind of flexibility that Monica expected and needed in Sect. 1.1. 

Let us now be more concrete about the requirements for such a framework for 
synthesis. Following the pattern for yacc given above, a framework for synthesis 
could follow a similar scheme: 


— An instance of a synthesis problem Synthesize(£, [-]c,y) has three param- 
eters: L, a formal language; [-]c¢, a semantics to ascribe to £; and y, a 
behavioral specification for some desired member of £L. The behavioral spec- 
ification y changes more frequently than £ and [-]<. 

— Let Feyntax and Fyemantics be appropriate formalisms for specifying £ and 
[-c, respectively. 

— Create a tool that implements the following specification: 

e Input: an Feyntax specification of a language’s syntax, and an Fyemantics 
specification of the language’s semantics. 

e Output: a function Synth; j.j; (+) that takes y as input and computes 
Synthesize(L, [-]c, p). 


To be even more concrete, Fyyntax could be a regular-tree grammar [5],' 
and Fyemantics Would be defined over the grammar in a compositional man- 
ner, production by production. What we have called collateral considerations 
(types, quantitative objectives, probabilities, etc.) would be handled as part 
of the Fsyntax Or Fsemantics Specifications, depending on the issue at hand. For 
instance, constraints on program behavior, such as refinement types [24], mini- 
mizing /bounding evaluation resources usage [11,15], and probabilistic behavior 


1 The grammar would also be equipped with production-by-production pretty-printing 
rules to specify how to convert a tree to its textual representation. 
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[22], are semantic concerns that would be part of Fyemantics- Other considerations 
would be part of Fsyntax, such as bounds on the use of syntactic constructs [12], 
or the use of probabilistic generative models of syntactic structures [3,17]. For 
instance, for these two issues, one could weight the productions of the grammar 
with values from a semiring, and place a (possibly learned) distribution on the 
productions, respectively. 

The scheme in the box above would allow us to meet the goals of being 
both domain-agnostic and solver-agnostic,? as long as (i) the formalisms for 
Feyntax and Fyemantics are sufficiently powerful to qualify as “domain-agnostic,” 
and (ii) specifications in these formalisms can be analyzed and broken down 
into components that can be farmed out to existing solvers (or perhaps to new 
implementations of the kinds of algorithms used in existing solvers). 


Who benefits from such a framework? The existence of a domain- and solver- 
agnostic framework benefits two parties: (i) users of synthesis tools such as 
Monica, and (ii) designers of synthesis tools, such as the team behind Ultra- 
Synth. Both scenarios can be illustrated by making an analogy with LLVM [20|— 
which provides an intermediate representation for compilation that is similarly 
both domain- and solver-agnostic. Users of LLVM, which are front-end language 
designers, benefit from two facts: (i) that the LLVM IR is rich enough to support 
the range of features their language might have, and (ii) that once their language 
is compiled down into LLVM IR, the entire library of LLVM IR optimizations is 
accessible to them. Similarly, a programmable synthesis framework benefits users 
in two ways: (i) by supporting the full range of features that may be required 
for a synthesis problem, and (ii) by putting multiple solvers within reach for 
problems written in the framework. Additionally, a well-defined framework also 
facilitates reuse of problem components: for example, Monica can reuse Geg for 
synthesizing other array transformations. 

On the other hand, backend optimization designers of LLVM benefit from 
the fact that once their optimization is written in LLVM, all LLVM users may 
easily access those optimizations if need be. Similarly, tool designers for a pro- 
grammable synthesis framework rest easy knowing that once their tool sup- 
ports the framework, those who need it will find it accessible and easy to use— 
regardless of what internal techniques they decide to use. Note that while the 
framework intends to be general, tools that interface with the framework can 
choose to be selective in the problems they support—it is up to the users, or 
perhaps the framework designers, to match a problem with an appropriate solver 
(similar to how language designers mix and match backend optimizations for 
their language in LLVM). In addition, advances at the framework-level—such as 


? We also acknowledge that even the scheme given above, which was modeled on the 
one for yacc, is open to revision. In particular, the additional degree of parameter- 
ization for synthesis (£, [-]c, and y) opens the door for a variety of alternatives, 
based on different “binding times” for £, [-] £, and y. For instance, a solver that uses 
different abstract domains as part of a refinement-based search strategy [29] would 
have £ and fixed, but vary [-]c. Similarly, when one has quantitative syntactic 
objectives [12], the solver would carry out its search with ọ fixed, £ varying, and 
[:]< induced as £ changes. 
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the development of meta-algorithms, as illustrated in Sect. 4.2—instantly benefit 
all tools that support the framework. 


This Paper. New technical challenges, as well as new opportunities, come along 
with our broader goals. In this paper, we present some of the work that we have 
done toward building the kind of framework sketched out above. 


Specifying Programmable Synthesis Problems (Sect. 3). Semantics- 
guided synthesis (SEMGUS) is our proposed framework that allows a user 
to provide both the syntax and semantics for the constructs in the language 
over which programs are to be synthesized. We show how SEMGUS can easily 
be extended with quantitative objectives for specifying when a synthesized 
program is “good” according to a certain metric—e.g., the program should be 
of minimal size or should maximize a certain outcome. 

Solving Programmable Synthesis Problems (Sect. 4). We present solvers 
that can tackle problems specified in the SEMGUS framework. We also present 
a meta-solver that can be combined with other SEMGUS solvers to sup- 
port quantitative objectives. Because our framework does not impose solver- 
specific restrictions on how synthesis problems are programmed, our solvers 
can prove unrealizability—i.e., whether a synthesis problem has no solution— 
of many complex synthesis problems with infinite search spaces. 


These steps are just the beginning of what we expect to be a multi-year jour- 
ney into designing a framework that achieves our goals, and solvers for such a 
framework. We discuss some of the open challenges and opportunities in Sect. 5. 


3 Programmable-Synthesis Specifications 


Designing synthesis frameworks that are programmable requires one to formally 
abstract the essence of how one specifies different program-synthesis problems. 
While we do not claim to have developed a completely unified framework that 
can capture all synthesis problems yet, in this section we present two ideas for 
programming many practical synthesis problems: (i) SEMGUS, a framework that 
uses logic and formal methods to make the search space and specifications of all 
synthesis problems easy to program in arbitrary domains (Sect. 3.1), and (ii) an 
extension of SEMGUS that allows one to specify quantitative objectives over the 
syntactic structure of a synthesized program (Sect. 3.2). 


3.1 Semantics-Guided Synthesis 


Existing work on program synthesis [1] typically identifies two main components 
to a synthesis problem: (i) a search space of candidate programs, which is in 
essence a small programming language, and (ii) a behavioral specification, which 
describes what the synthesized program should do. A programmable synthesis 
framework must represent (at the very least) these two components in a domain- 
and solver-agnostic way. Take the syntax-guided synthesis (SyGuS) framework, 
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for example: SYGUS achieves solver-agnosticity by representing the search space 
as a regular tree grammar, and the specification as a Boolean formula in a 
decidable background theory. 

Then why is SyYGUS, and this particular combination of representations, 
unable to achieve domain-agnosticity? The syntactic component of SYGUS— 
the grammar—actually does achieve some degree of domain-agnosticity, in the 
sense that one is free to define a language of one’s own. However, SYGUS requires 
that the specified grammar be contained within a fixed background theory, which 
are terms with a pre-defined fixed and standardized semantics. While this design 
choice makes the solutions to SYGUS problems easy to verify (using an SMT 
solver), it limits the programmability of the search space. 

For example, let us reconsider the example in Sect. 1. If Monica attempted 
to write her example as a SYGUS problem, she would have been unable to use 
loops because loops are not part of the supported background theory. What if 
Monica wanted a solution that operates over a DSL, or had some pre-defined 
components that she wanted to use (like len(arr))? What if Monica wanted 
to synthesize regular expressions, or some other programs with relatively non- 
standard semantics? 

One can intuitively understand these scenarios as synthesis problems over dif- 
ferent programming languages (search spaces)—a DSL, library functions, regular 
expressions. To support different programming languages, a synthesis framework 
needs more than the ability to accept a syntax, it needs the ability to accept 
a semantics for a language as well. Therefore, developing a programmable syn- 
thesis framework capable of supporting all these scenarios requires designing 
a solver-agnostic way of specifying the semantics of such arbitrary program- 
ming languages. SyGuS has shown that regular tree grammars are an effective 
formalism for programming the syntax of a search space; we extend this with a 
formalism to program the semantics of the search space as well, which, to achieve 
true domain-agnosticity, need not be constrained to a fixed background theory. 


Semantics as Constrained Horn Clauses. Our solution to this challenge is the 
Semantics-Guided Synthesis (SEMGUS) framework [14], which allows users to 
customize the syntax and semantics of the search space. To see how SEM- 
GUS supports programmable semantics, let us consider the production Start —> 
while x>=0 do S from Fig. 1 as an example. This production is a while loop, and 
part of the semantics for a term produced by this production can be expressed 
using the inference rule below? (where I’ represents a state that maps variables 
to integer values): 


[x>=0]| (r) = True [s]()=1,_ [while x>=0 do s](I,) = In 
[while x>=0 do s| (T`) = I> (1) 


Such semantics are supported in the SEMGUS framework by expressing the 
inference rule in Eq. (1) as a Constrained Horn Clause (CHC). CHCs are logical 
formulas, and more precisely, they are implications where one is only allowed 


3 A similar rule must be added for the case in which the guard evaluates to false. 
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to have a single relation in the conclusion, and a conjunction of relations along 
with one constraint in the premise: 


Definition 1 (Constrained Horn Clauses.). A Constrained Horn Clause is 
a first-order formula of the form 


YT, Ti,- EnG A Ri (Ti) A+++ A Ran) => H(2)), 


where ġ is a constraint over some background theory that may contain variables 
=> SS = . . 
from @,2j,...,%n, and Rı,..., Rn and H are uninterpreted relations. 


In SEMGUS, search spaces are represented as regular tree grammars, where 
productions have associated semantics. In Eq. (1), the semantics of a term x>=0 
is represented using the semantic function |-]. SEMGUS, assumes that each non- 
terminal N appearing in the grammar has a corresponding logical relation semy, 
which we refer to as the semantic relation, that represents the behavior of the 
semantic function [-] in Eq. (1). For example, the expression [s](I’) = I from 
Eq. (1) can be translated into the relation semg((s, I"), Ti). 


Example 2 (Semantic Rules as CHCs). The following CHC captures how 
one would express in SEMGUS the semantics of the production Start — 
while x>=0 do S shown in Eq. (1): 


rx] >0 sems((s, I), I) semgtarz((while x>=0 do s, I), T2) 
semstart((while x>=0 do s, I`}, T2) (2) 


One can read Eq. (2) as the following implication: 


sems((s, I’), 11)Asemgtart((while x>=0 do s, Ti), T2) AI[x] > 0 => 
sem start( (while x>=0 do s, I`), T>) 


(3) 
Equation (3) is a CHC where semstart and semg are relations, and [x] > 0 
corresponds to the first-order constraint @. 


SEMGUS allows one to specify multiple such CHCs* for each production in the 
grammar. CHCs are the logical formalism of choice for expressing these semantics 
in a language-agnostic way, which are an intuitive and expressive format. 

The SEMGUS Framework. Once a user has understood how to define a seman- 
tics for their grammar, a SEMGUS problem then can be specified simply as a 
synthesis problem over a grammar equipped with such a semantics. 


Definition 2 (SEMGUS). A SEMGUS problem over a theory T is a tuple 
sem = (G1, V(x, f(x))), where: 


- G is a regular tree grammar equipped with the semantics [-], 


* The ability to define multiple semantic rules for a production is useful for productions 
such as while loops, which are commonly equipped with two rules that describe 
looping and loop termination. 
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- w(x, f(x)) is a Boolean formula over the theory T, that serves as the behav- 
ioral specification, 
— f is a free second-order variable that serves as the function to be synthesized. 


A solution to the SEMGUS problem sem is a term s € L(G.) such that 
(a, [s](2)) holds. 


Example 8 (Monica’s Synthesis Problem in SEMGUS). Consider the synthe- 
sis problem Monica had in Sect. 1. Let Gerp] be the grammar Ger from 
Fig. 1, equipped with semantic rules such as the one defined in Eq. (2). Let 
E = {[-1, 2,3, 10,31, —14, —11]}, the input array Monica considered for her task. 
Let (arr, f(arr)) be a formula over the theory of arrays and CLIA describing 
what it means for the program f to be correct on an input arr: 


w(arr, f(arr)) = \ f(arr){i] = ITE(arr{i] > 0, arr[i] + 10,0). 


0<i<len(arr) 


Then semer = (Geapy, Narren Varr, f(arr))) is a SEMGUS problem defined 
over a background theory of arrays and CLIA—the behavioral specification 
requires that the final program satisfies all the examples in E.° Moreover, semez 
is written in a completely logical format, and is thus not tied to a specific tool 
like UltraSynth and can be dispatched to multiple backend solvers (assuming 
tooling) as Monica pleases. 


The ability to customize the semantics for a language in a framework allows 
that framework to support a plethora of different synthesis problems. One can 
define synthesis problems over regular expressions, domain-specific languages, 
imperative languages, or any other language that has a semantics definable as 
CHCs within the framework, all of which can be tested using different solvers 
utilizing different strategies. 


Example 4 (Regular Expressions Synthesis in SEMGUS). Synthesis problems 
over regular expressions can be expressed succinctly in SEMGUS. The gram- 
mar of regular expressions can be captured with the following grammar, where 
c is a character and ¢ the empty set: 


Rocle|¢|R+R|R-R| BR 


Using CHCs, one can also naturally express the semantics of terms r € L(R). For 
example, the semantics of Kleene star can be given as the following two CHCs: 


semp(r,5,) semp(r*,s2) $s = S152 


semp(r*,€) semp(r*, s) 


5 In this example, one could have used a formula simply describing the input/output 
examples instead of a more complex logical formula. We chose the latter option to 
illustrate how the behavioral specification can involve terms in interesting theories— 
e.g., CLIA and arrays. 


96 L. D’ Antoni et al. 


The rules are based on the expansion r* — e+ r- r*: the first rule lets r* 
accept €, and the second rule accepts a string s by finding two substrings s1, s2, 
such that sı is accepted by r, s2 is accepted by r*, and the concatenation s1 - $2 
is equal to s. The specification of the problem can then use expressions of the 
form semg(r,s) and ~semg(r, s) to denote whether an example s is positive or 
negative, respectively. 


3.2 Adding Quantitative Syntactic Objectives 


In the example discussed in Sect. 1.1, the original synthesis problem Monica 
posed to the solver was under-constrained and caused the underlying tool to 
synthesize an undesirable solution that contained unnecessary operations. While 
the logical-specification mechanism is powerful, it can only capture the func- 
tional requirements of the synthesis problem—e.g., the program should perform 
correctly on a given set of input/output examples. When multiple possible pro- 
grams can satisfy the specification, a programmable synthesis framework should 
provide a way to prefer one to the other—i.e., the user of the framework should 
be able to describe a quantitative objective. In this section, we show how the 
formal foundations of SEMGUS (i.e., the use of grammars and logic) allow us to 
easily extend the framework to incorporate quantitative objectives over the syn- 
tax of the synthesized program. The ideas we present were originally described 
in the context of SyGuS [12]; here we show how they can also be applied to 
SEMGUS. 


Adding Quantitative Objectives Using Weighted Grammars. Recall that a SEM- 
GUS problem is given along with a regular tree grammar specifying the search 
space. In our running example, Monica would like to synthesize a program that 
has few occurrences of the minus operator. A natural way to express this intent 
is allowing Monica to tag productions involving such an operator with a cost, 
let’s say 1. Our quantitative extension of SEMGUS builds on this intuition and 
allows users to add weights/costs to productions in the grammar. This extension 
leads to a well-studied formalism, weighted tree grammars, keeping the SEMGUS 
framework general. Intuitively, a weighted tree grammar is a grammar in which 
each production p has an associated weight /cost ju(p). 

Intuitively, the weight of a derivation tree is the sum of the weights of all 
productions.° For simplicity, in this paper, we assume that the domain of weights 
is the natural numbers, and that their sum is the usual application of the +- 
operator. We use Wa(t) to denote the weight of a term t with respect to the 
weighted grammar G. 

With the weights specified by the weighted grammars, users can specify quan- 
titative objectives as constraint objectives and optimization objectives. A con- 
straint objective is a predicate w(v) over a numerical variable v; we say that 


6 Weights have to come equipped with operators that tell us how to combine weights 
of individual productions to obtain the weights of terms. Formally, the weights must 
be from a semiring; we refer the reader to the original work on this topic [12] for 
details. 
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a term t satisfies the constraint objective if w(we(t)) holds. An optimization 
objective is a flag OPT € {True, False} indicating whether we want to minimize 
the weight of the solution. 


Example 5. Recall that in the example introduced in Sect. 1, Monica wants to 
avoid redundant occurrences of the minus ( - ) operator. To express this intent 
in SEMGUS, Monica can utilize the following weighted grammar. 


Start > x = len(arr) - 1; while x>=0 do S 
S — arr[E] = arr[E] + E | arr[E] = E | 
x= F|S; S|if arr[x]>0 then S else S 
E>0|1|x|E+E|E-E/1 


In the weighted grammar, only the rule E — E - E is assigned the weight 1. All 
other rules are assigned the weight 0 (omitted for readability). The weight of a 
term t with respect to this grammar is the number of occurrence of the minus 
operator in t. If Monica wants to restrict the number of occurrences of the minus 
operators to be less than 5, she can use the constraint objective w(v) = v < 5. 
Furthermore, if she want to minimize the occurrences of the minus operator, she 
can set the flag OPT to True. 


To summarize, a SEMGUS problem with quantitative syntactic objectives is a 
tuple sem = (W_.], W(x, f(z)),w, OPT) where Wj.) is a weighted grammar with a 
corresponding semantics, Y is a Boolean formula like before, w is the constraint 
objective, and the flag OPT is the optimization objective. The goal is to find 
a solution that not only satisfies the specification Y, but also the quantitative 
objective w, and is of minimal cost if OPT is set to True. 

Quantitative syntactic objectives are useful in applications such as program- 
ming by examples [10] and program repair [6], where it is desirable to produce 
small programs with fewer constants, because such programs are more likely to 
generalize to examples and test cases outside of the set of examples given by the 
user. When allowing real-valued weights, syntactic objectives can be also used 
to find the most likely solution with respect to a given probability distribution. 
We can assign productions weights that represent their probabilities; the weight 
of a candidate solution is its likelihood. 


4 Programmable-Synthesis Solvers 


While a programmable synthesis framework as discussed in Sect. 3 is certainly 
desirable, it is of little practical use if one is unable to solve the problems that are 
written in such a framework. In this section, we show that SEMGUS problems 
can be solved practically. We first describe two general solving techniques for 
SEMGUS (Sect. 4.1) and then present new algorithmic solving techniques enabled 
by the SEMGUS framework (Sect. 4.2). 
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4.1 General Solving Procedures for SEmMGuS Problems 


We start off by presenting two solving procedures for general SEMGUS prob- 
lems we implemented as a tool, rooted in strategies commonly used in existing 
program synthesizers: enumeration (used in the tool MESSY-Enum) and con- 
straint solving (used in the tool MESSY). Specifically, we will be considering 
SEMGUS-with-examples problems: SEMGUS problems where the specification is 
given in terms of a finite set of examples E. An algorithm for solving SEMGUS- 
with-examples problems can be combined with counterexample-guided inductive 
synthesis (CEGIS) [27], which generates counterexamples in case a synthesized 
answer does not meet the general specification, to iteratively increase the exam- 
ple set E and eventually obtain a correct program. 


MESSY-Enum: A Basic Enumerator for SEMGUS Problems. Because 
SEMGUS also relies on a grammar to specify the syntax of valid terms, like 
SYGUS, one can employ a simple enumerator that generates terms of increasing 
size from the grammar and test the enumerated terms against the behavioral 
specification. With SEMGUS, a term (representing a program) cannot be exe- 
cuted directly, because the semantics to ascribe to it has been specified in the 
semantic specification. However, because the semantics is specified with CHCs, 
the term can be executed with a level of interpretation supplied by an off-the- 
shelf CHC solver. Therefore, MESSY-Enum employs an off-the-shelf CHC solver 
such as [18] to check if the CHCs are consistent with the specification.” 

Concretely, given a term te to test, one can use the following CHC to check 
whether te meets the specification: 


Wace semstart ((€s, te), oi) 
Realizable 


Query (4) 

The Query rule in Eq. (4) exactly encodes the specification as a CHC: it asks 
whether the semantics of te computed by Semstart is consistent with the set of 
input-output examples E. If so, the conclusion Realizable is provable using the 
existing set of CHCs—.e., te is a solution to the synthesis problem. 

Because we cannot directly execute candidate terms and instead rely on CHC 
solvers (which may be treated as a blackbox), it is difficult to employ common 
enumeration optimizations, such as behavioral equivalence caching, or equality 
saturation. Developing an enumeration-based solver capable of utilizing these 
ideas would require generating an explicit and efficiently executable interpreter 
from the given semantics, which is an interesting research challenge and future 
direction that we discuss in Sect. 5. 


T One can treat CHC solving as akin to a proof search, where the objective is to prove 
that a specific query holds (in this case, Realizable from Eq. (4)) using the provided 
CHCs. 
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MESSY: SEMGuUS Problem Solving as CHC-Solving. MESSY-Enum uses 
a CHC solver to check whether an enumerated term te is consistent with the spec- 
ification or not—however, CHC solvers are also capable of automatically search- 
ing for terms that satisfy the specification, as well. Our next solver, MESSY, 
takes advantage of this fact by expressing both the syntax of the search space 
and the semantics using CHCs. Once the entire search space is modeled this way, 
one can then slightly modify the Query rule to accommodate this change and 
directly use a CHC solver to solve the entire SEMGUS problem. In essence, Messy 
reduces solving the SEMGUS problem into finding a configuration of variables 
for which the set of CHC rules (containing syntax, semantics, and specification) 
is valid—similar to how constraint-based methods in existing synthesizers reduce 
the synthesis problem to one of solving a set of constraints. 


Example 6 (MESSY Encoding). We show how the syntax and semantics used 
in the production Start — while x>=0 do S from Fig. 1 can be captured using 
CHCs. This production states that one can obtain a syntactically valid term 
while x>=0 do s € L(Start) for the nonterminal Start, given a valid term s € 
L(S). Equation (5) encodes this idea as a CHC using the syntax relations syng, 
and syn start, Which capture whether the supplied arguments are valid terms that 
may be derived from the corresponding nonterminals S, and Start. 


syns (s) 
SYN start (While x>=0 do s) (5) 


Because the syntax relations provide a way to guarantee that a term t is a 
valid term in the syntax of a SEMGUS problem, one can rewrite the Query rule 
from Eq. (4) to use this relation instead of an explicitly enumerated term te. 


SYD start (t) Neck sem start ((€:, t), oi) 
Realizable 


Query (6) 

The new Query rule in Eq. (6) has the term t as a free variable—i.e., proving 
Realizable amounts to finding a term t € L(Start) that is consistent on the 
input-output examples. A CHC solver presented with this rule, in tandem with 
the syntax and semantic rules, will then attempt to find a configuration of t such 
that Realizable holds. If the solver can prove that the premises of Equation Eq. 
(4) hold, then the term t is a solution to the SEMGUS problem. 


One of the advantages of using such a CHC-based method is when dealing 
with cases where there is no answer to the synthesis problem, i.e., when there 
exists no t such that Realizable holds. In this case, the SEMGUS problem con- 
tains no answer satisfying the specification within its search space; we say that 
such a problem is unrealizable. Proving unrealizability is something that many 
existing solvers fail to consider, but is important: for example, Monica would 
not have had to wait for several hours after modifying the grammar in Sect. 1 if 
her solver had been able to show that the problem was unrealizable. 
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4.2 Meta Algorithms for Solving SEmMGuS Problems 


Now that we have shown how to build solvers for general SEMGUS problems 
(that do not involve quantitative objectives), we turn to ‘meta’-algorithms for 
solving SEMGUS problems, which are ‘meta’ in the sense that they (i) may be 
used atop any general SEMGUS solver, (ii) generate modified SEMGUS problems 
(rather than solutions) that can be easier to solve than the original SEMGUS 
problem or can be used to solve SEMGUS problems with quantitative objectives. 
The key component behind these meta-algorithms is the customizability of the 
search-space description in SEMGUS. 


A Meta Solver for Quantitative Objectives. We first present an algorithm 
for solving SEMGUS problems with quantitative objectives [12|—i.e., where pro- 
ductions in the grammar have weights. We assume, for simplicity, that the only 
quantitative objective is to find the program of least cost that satisfies the speci- 
fication. The idea of the algorithm is to iteratively reduce the SEMGUS problem 
with a quantitative objective to a sequence of SEMGUS problems without quan- 
titative objectives, which are used to iteratively find a solution that has least 
cost—i.e., at each step of the sequence the cost of the solution is improved. 

The algorithm operates as follows. Initially, we are given a SEMGUS prob- 
lem sem with a weighted grammar W (we omit the semantic information for 
brevity) and with the minimization objective OPT set to true. The first step of 
the algorithm is to construct an unweighted grammar G™ by merely erasing all 
the weights in W. We can now use any SEMGUS solver to solve the resulting 
SEMGUS problem and obtain a term to. This term will have a weight c accord- 
ing to the weighted grammar W, but it might not be the term of least cost that 
satisfies the specification. Our algorithm therefore tries to find out whether a 
solution with a lower weight exists, and accordingly constructs an (unweighted) 
grammar GW. such that a term t is accepted by the grammar GY, if and only if 
the weight of t according to W is less than c. When the weights are natural num- 
bers, this construction is always possible [12]. We now have again an unweighted 
grammar, and we can use a SEMGUS solver to solve the resulting problem. This 
procedure can be repeated until no better solution exists. 


Example 7. Consider the weighted grammar W we presented in Example 5. In 
particular, let us focus our attention on the following subset of productions that 
involve non-zero weights: 


E->0|1|x|H+E|E-E/1 
The grammar GW}, which accepts all terms of weight less than 3 is as follows: 
E > Eg | Ey | Eo 
E > E; - Eo | Eo- Ei 


Eı — Eo - Eo 
Eo > 0|1|x| Eo + Eo 


8 For simplicity, we assume no further quantitative objectives are present, but the 
general case can be handled using similar ideas [12]. 
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Intuitively, each non-terminal Æ; produces all and only terms with exactly i 
minus operators. 


The meta solver for quantitative objectives shows how using a solver-agnostic 
specification formalism—i.e., grammars—enables algorithms that operate at the 
specification level and can be reused across multiple solvers. 


Underapproximating Semantics with SEmGuS. The previous section 
showed how the programmability of the search-space syntax (i.e., the gram- 
mar) allows us to design meta-algorithms to solve SEMGUS problems involving 
quantitative objectives. In this section, we show how the programmability of 
the search-space semantics can be used to build meta-algorithms that can make 
synthesis faster. The key idea is to generate “simpler” variants of the original 
SEMGUS problem that use an underapproximating semantics, where an under- 
approximating semantics is defined as a subset of the original semantics that 
must be precise on the subset on which it is defined. 


Definition 3. For a grammar G equipped with a semantics [-], we say |]? 
underapproximates [-] on G, or that [|]? is an underapproximating semantics 
for G with respect to [-], if for every term t € L(G), every state I, and every 
value v on which [-]’ is defined, [tP (T, v) = [t](L, v). 


One easy way to underapproximate a semantics is to simply “eliminate” cer- 
tain operators from a grammar by not defining semantic rules for them. How- 
ever, the concept of underapproximation need not be bounded to eliminating 
operators from a grammar—it may have a fully semantic meaning instead, for 
example, a bound on the number of possible loop iterations. The key intuition is 
that underapproximation is sound for use in synthesis—if a term t is the answer 
to a synthesis problem sy, sy actually does not need to contain any syntax or 
semantics outside of what is used to define and compute t. (In contrast, overap- 
proximation is sound for proving unrealizability.) 


Example 8. Recall, once again, the synthesis problem Monica has in Sect. 1. The 
grammar Gex of Fig. 1 contains a while loop, which has a complex semantics 
that can be expensive to compute and, most importantly, allows nonterminating 
behavior. Most existing synthesizers [27,28] explicitly prohibit nontermination 
by only considering finitely many unrollings for loops (because most answers to 
a synthesis problem will indeed terminate). 

Fortunately, Monica knows that on her example [—1,2,3,10,31,—14,—11], 
the loop should iterate no more than 7 times to process every element of the array. 
Monica may then choose to supply the synthesizer with an underapproximating 
semantics that limits the number of loop iterations to 7, which could greatly 
reduce the amount of computation the synthesizer must perform—for example, 
a naive enumerator might get stuck on a nonterminating loop when using the 
precise semantics, while terminating quickly when using the underapproximating 
semantics. Such a semantics can be expressed easily by adding a loop counter c 
to the semantics of loops given in Eq. (2), yielding the following CHC: 


102 L. D’ Antoni et al. 


c>0 I[x}>0 sems((s,f,c—1),I1) semstare((while x>=0 do s, Ii, c— 1), T2) 
semstart ((while x>=0 do s, Ic), T2) 
(7) 


Setting c = 7 in the Query rule now ensures that loops run at most 7 iterations. 


Abstract Semantics with SEmGuS. Similar to how we used underapproxi- 
mating semantics to find solutions to a SEMGUS problem, abstract (overappoxi- 
mating) semantics can be used to prove that a SEMGUS problem is unrealizable. 


Definition 4. For a grammar G equipped with a semantics |-], we say [-]* is an 
abstract semantics for G with respect to [-] if there exists an abstraction function 
a and a concretization function y, such that for allt € L(G), if [t] (T, v) holds, 
then [t]*(a(L),a(v)) holds, and F € y(a(L)), v € y(a(v)), i.e., a and y form 
a Galois connection. 


In contrast to underapproximating semantics, abstract semantics are sound 
when used to prove unrealizability—i.e., that a synthesis problem has no solution 
that satisfies its specification within its search space. Consider the use of abstract 
interpretation in program analysis: abstract interpretation is most often used to 
prove that a program cannot reach a certain set of bad states, while often being 
unable to guarantee that a program will produce a specific value, due to lack of 
precision. Similarly, an abstract semantics will often be unable to guarantee that 
a synthesized program satisfies the specification, due to lack of precision—but it 
can guarantee that all programs in the search space will never be able to produce 
a certain set of values, which can be used to prove unrealizability. 


Example 9. Consider the scenario from Sect. 1, in which Monica removed sub- 
traction from her grammar in an attempt to simplify the synthesized program. 
The removal of subtraction made the problem unrealizable—and UltraSynth ran 
for hours on end because it could not prove that this was the case. While prov- 
ing unrealizability can be very difficult in general, a solver capable of reasoning 
about abstract domains and semantics could have utilized an (abstract) semantic 
rule such as Eq. (8) below: 


semp((e1,I°), {pos}) semp((e2, I’), {pos}) 
semm((e1 + e2, T`), {pos}) (8) 


Equation (8) is defined on the abstract domain {pos, zero, neg}—corresponding 
to positive, zero, and negative values—and captures the fact that the sum of two 
positive numbers will always be positive. This rule will be able to prove that Ger 
without subtraction will never be able to modify x to a negative value, and thus 
that no program in the search space will terminate (leading to unrealizability). 


Unrealizability is a property that is ignored by many current synthesizers, 
but it is a very important property nonetheless. One practical way to think about 
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unrealizability is as a sanity check, like a type system: the fact that a synthe- 
sis problem provided by an end user is unrealizable means that the synthesis 
problem is malformed in the sense that the user has got some of their specifi- 
cations wrong. Similar scenarios happen daily with ordinary programming, and 
we expect them to happen with synthesis as well—thus, it is desirable that syn- 
thesizers be able to detect these problems, and report them early on if possible, 
without running indefinitely, as in Sect. 1. Unrealizability also has applications 
in computing optimal solutions, as in Sect. 4.2: unrealizability given a grammar 
with a lower weight bound ensures that the current solution is optimal. 


5 The Future of Programmable Synthesis and SEMGuS 


We hope we have convinced the reader that synthesis could use more programma- 
bility, and that SEMGUS addresses many of the programmability issues of exist- 
ing synthesis work. But what lies ahead? How can we make programmable syn- 
thesis truly practical? In this section, we first outline some of the steps we are 
undertaking to answer this question (Sect. 5.1). 

More importantly, we would also like to emphasize that the vision of pro- 
grammable program synthesis can only be realized through a community effort. 
We will conclude this section with ideas to involve the synthesis community to 
help us realize our vision (Sect. 5.2). 


5.1 What Are We Working on Next? 


In this section, we present some of the directions our group is pursuing in extend- 
ing SEMGUS to richer objectives and building better solvers for it. We also 
describe some open problems related to SEMGUS. 


Interfacing Existing Program Synthesizers with SEMGUS. The bulk of our dis- 
cussion in Sect. 3 was about achieving domain-agnosticity by building upon the 
ideas that SYGUS used in achieving solver-agnosticity. However, there also exist 
synthesis tools that are already domain-agnostic; most notably, solver-aided lan- 
guages such as Sketch [27], Rosette [28], MiniKanren [8], and Prose [25]. While 
these tools are not solver-agnostic, they can in principle be used as SEMGUS 
solvers by virtue of their domain-agnosticity. 

To use such existing tools as SEMGUS solvers, one must develop a compiler 
of sorts to translate a SEMGUS problem (written in the logical format from 
Sect. 3) to the specific front-end language of the tool. This task is not trivial 
for a number of reasons. First, each of these tools implement restrictions on the 
types of synthesis problems they accept; these restrictions are what enables their 
fast algorithms. For example, Rosette, Sketch, and MiniKanren only support 
finite search spaces (i.e., finite grammars), and this fact is encoded in different 
ways for different tools (e.g., by imposing bounds on the search depth or by 
imposing syntactic bounds on the search space). Second, some of these solvers 
implicitly use limited semantics—e.g., Sketch limits how many times a loop can 
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be executed. Third, some of these solvers require special inputs that are useful to 
guide the synthesis engine—e.g., Prose requires the user to provide a semantics 
for each operator in the input language as well as an inverse semantics that 
can be executed backwards; the inverse semantics is used to perform efficient 
enumeration. 

Soundly compiling SEMGUS problems to these tools requires one to modify 
the original problems to fit these restrictions. Thankfully, the flexibility of SEM- 
GUS comes to our aid! In Sect. 4.2, we have described ideas for transforming 
SEMGUS problems using restricted grammars or underapproximating semantics. 
These transformations are sound for synthesis—i.e., a solution to the transformed 
synthesis problem, which satisfies the restrictions of a particular external tool, is 
still a solution to the original problem—and thus can be used to interface with 
external tools. We are currently working on automating such translations. 

The case of Prose is particularly interesting in that it requires inverse seman- 
tics, which are not immediately available from a SEMGUS problem. However, 
because SEMGUS semantics are expressed logically as CHCs, one can automati- 
cally invert these semantics starting from the CHCs—we are currently developing 
a tool that performs this inversion automatically and uses the inverse semantics 
to interface with Prose. 

Other more specialized solvers, such as those for synthesizing regular expres- 
sions [23], could also be interfaced with our framework, with the limitation that 
they will only be able to handle specific problems. The more general question 
here is: how can we determine whether a specific SEMGUS problem is com- 
patible with a specialized solver? We are working on designing “theories” that 
describe specific semantics for which specialized solvers exist. For example, if one 
were to use SEMGUS to work with regular expressions, they could import the 
regular-expression theory, which by design would enable compatibility with cer- 
tain solvers. Note that this approach is still solver-agnostic because any general 
SEMGUS solver would still be able to use this problem definition. 


Lifting Existing Synthesis Algorithms to Work with SEMGuS. While interfac- 
ing existing synthesizers with SEMGUS is one straightforward way of creating 
SEMGUS solvers, we envision that higher efficiency can be achieved by designing 
solvers that take advantage of the structure of SEMGUS problems. Is it possible 
to lift algorithms (not tools) that have previously been successful with SYGUS 
or other synthesizers up into SEMGUS? 

For example, consider the problem of building an efficient enumeration algo- 
rithm for SEMGUS, an algorithmic technique that is now successfully employed 
in most SYGUS solvers [2,4,21]. The success of enumeration has been driven by 
a number of clever ideas for efficiently pruning the search space of relevant pro- 
grams. An example was mentioned in Sect. 4.1, where we discussed the challenges 
with employing strategies such as behavioral equivalence caching or equality sat- 
uration on SEMGUS due to the lack of an executable semantics—i.e., in SEM- 
GUS, evaluating a term on an input requires a costly call to a CHC solver. We 
are currently building an enumeration algorithm for SEMGUS that addresses 
this limitation. Our algorithm first synthesizes an executable interpreter from 
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the SEMGUS problem semantics, and then uses this executable interpreter to 
guide the search. To scale, our approach must handle other challenges, which we 
are also working on—e.g., discovering which operators have a semantics that is 
associative or commutative can help us avoid enumerating equivalent terms. 

While the generality of SEMGUS is an obstacle to adapting some well-known 
algorithms, the same generality also helps SEMGUS provide a natural inter- 
face to express other algorithms, such as program synthesis using abstraction 
refinement [29]. The approach taken here is to synthesize programs that work 
on an abstract domain, and repeatedly refine the abstract domain until a pro- 
gram is found that is correct under the concrete semantics. This approach, in 
a sense, uses a meta-algorithm that can be expressed naturally in SEMGUS, 
as discussed in Sect. 4.2. We believe that SEMGUS will naturally be able to 
express many such meta-algorithms, and further accelerate the development of 
new meta-algorithms. 


Supporting Richer Specifications. Beyond the basic specification mechanisms, 
SEMGUS already supports syntactic quantitative objectives through weighted 
grammars (Sect. 3.2). To capture the breadth of specifications appearing in 
modern synthesis applications, the SEMGUS framework will have to evolve over 
time. While we are investigating a number of complex objectives that will require 
extensions to the framework (e.g., probabilistic specifications), in the following 
paragraph we describe a specification mechanism the current SEMGUS frame- 
work can already capture for free: types. 

Consider the problem of synthesizing a program that meets a given time 
complexity (or asymptotic resource usage in general) [11,16]. In existing work, 
such bounds are specified (and proven correct) using a dependent type system. 
The solver uses the type system to guide the search, by enumerating only terms 
that satisfy a certain type. We observe that the SEMGUS framework is already 
able to capture such type-based specifications! In particular, types are a form 
of static semantics that can be associated with terms and, in most cases, typing 
rules can be encoded as CHCs, similarly to how one encodes semantic rules. For 
example, the following dependent type rule can be captured using a CHC where 
each typing judgment t : type is described using a relation r(t, type). 


a: {Int | ga(v)} 6: {Int | yo(v)} +: a”:Int > y: Int > {Int | v=x +y} 
a+b: {Int | v =x + yA palz) A poly)} (9) 


5.2 What Can the Synthesis Community Do? 
As we mentioned at the beginning of this section: 


The vision of programmable program synthesis can only be realized through a 
community effort. 


We discuss problems the community can help with in this concluding section. 
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A Broader Scope for Synthesis. The scope and potential of synthesis is very 
broad, in fact even broader than what has been discussed in this paper. An 
invited paper by Gulwani began [9] 


Program Synthesis is the task of discovering an executable program from 
user intent expressed in the form of some constraints. 


However, we feel that this viewpoint is actually somewhat narrow. We believe 
that insight on many problems can be obtained via the “lens” of synthesis: for 
many computing tasks, the goal is to produce some artifact to which some seman- 
tics is attached, and the process of producing that artifact can be thought of as 
a synthesis problem. For instance, in an AI planning problem, the artifact is a 
plan—i.e., Monica from Sect. 1 is a robot, and the sought-for program must navi- 
gate her from point A to point B (e.g., minimizing power consumption and time, 
while avoiding collisions and satisfying other safety guarantees). Closer to home 
for the CAV community, inside many tools for statically checking assertions in 
programs (such as SLAM or BLAST), the key component is one that creates 
an abstracted model of the program that is sufficiently precise to show that an 
assertion violation is not possible. Among the artifacts that may need to be syn- 
thesized are inductive invariants, abstract transformers, function summaries, and 
interface specifications. Thus, we conclude by offering the following wider defini- 
tion of synthesis, which connects this broader outlook with the semantics-based 
perspective that we have presented in this paper: 


Synthesis is the task of discovering a syntactic object—selected from 
some formalism in which each syntactic object has a rigorously defined 
semantics—from an “intent” expressed in the form of some kind of con- 
straint. 


We believe that the issues discussed in this paper will be increasingly important 
if synthesis is to be applied successfully to the creation of artifacts that have 
semantics, but are not programs per se. 

The generality of our framework can bridge the gap between the many appli- 
cations of synthesis, and we hope that the community will engage in our work 
by modeling their synthesis problems in SEMGUS, and by adapting their solvers 
to work with SEMGUS. Such contributions will result in new benchmarks and 
solvers, contributing to the programmability and effectiveness of SEMGUS. 


Standardization and Competitions. We believe that the idea of a programmable 
synthesis framework, and SEMGUS, the start of such a framework, represents 
a step forward in program synthesis. Similarly to what happened with SyGuS, 
SEMGUS must be standardized, other researchers should build solvers for it, and 
these solvers should compete annually in SEMGUS competitions. 

We hope that this paper will encourage readers to experiment with and 
advance the ideas presented here, in three ways: First, we hope that the generality 
of the framework will make it easy for people to use it on various problems, which 
in turn will make it easy to collect large and diverse sets of benchmarks that 
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will make the design of new solvers focused and effective. Second, we hope that 
researchers will build new algorithms and techniques that are general and can 
solve problems built in this framework. Third, we hope to soon create a yearly 
competition that will foster further interest in building general synthesizers for 
our framework. More than anything, this paper is a call-to-arms—an invitation 
to help broaden the scope and abilities of program synthesis, toward an era where 
Monica uses synthesizers just as much as Python during her daily work. 
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Abstract. This paper presents the main ideas behind deductive synthe- 
sis of heap-manipulating program and outlines present challenges faced 
by this approach as well as future opportunities for its applications. 


1 Introduction 


Just like a journey of a thousand miles begins with a single step, an imple- 
mentation of a working operating system, cryptographic library, or a compiler 
begins with writing a single function. This is not quite so for verified software, 
whose development starts with three “steps”: a function specification (or, spec), 
followed by its implementation, and then by a proof that the implementation 
satisfies the spec. Although recent years have seen an explosion of increasingly 
diverse and sophisticated verified systems [14,20,26,31,41,48,71,73, 96], their 
cost remains high, owing to the effort required to write formal specifications and 
proofs in addition to writing the code. 

The good news is that in many cases the aforementioned three steps can be 
replaced by just one of them: writing the spec. The rest can be delegated to 
deductive program synthesis [52|—an emerging approach to automated software 
development, which takes as input a specifications, and searches for a corre- 
sponding program together with its proof. 

Past approaches to deductive synthesis typically avoided low-level programs 
with pointers [43,69,83], which are notoriously difficult to reason about, making 
these approaches inapplicable to automating the development of verified systems 
code. The few techniques that did handle the heap [47,72] had significant limita- 
tions in terms of expressiveness and/or efficiency. Our prior work on the SUSLIK 
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synthesizer [70], has introduced an alternative approach to synthesis of pointer- 
manipulating programs, whose key enabling component is the use of Separation 
Logic (SL) [66,75] as the specification formalism. Due to its proof scalability, 
Separation Logic enabled modular verification of low-level imperative code and 
has been implemented in a large number of automated and interactive program 
verifiers [4,7,18,37,57,62,64,68]. The main novelty of SUSLIK was an observa- 
tion that the structure of SL specifications can be used to efficiently guide the 
search for a program and its proof. Since then, our follow-up work has explored 
automatic discovery of recursive auxiliary functions [34], generating indepen- 
dently checkable proof certificates for synthesized programs [93], and giving the 
user more control over the synthesis using concise mutability annotations [19]. 

As an appetizer for SL-powered deductive program synthesis consider the 
problem of flattening a binary tree data structure into a doubly-linked list. 
Assume also that the programmer would prefer to perform this transforma- 
tion in-place, without allocating new memory, which they conjecture is possible 
because the nodes of the two data structures have the same size (both are records 
with a payload and two pointers). With SUSLIK, the programmer can describe 
this transformation using the following Hoare-style SL specification: 


{tree(x, S)} flatten (loc x) {dll(x,y,S)} (1) 


Here the precondition asserts that ini- 
tially x points to the root of a tree, whose 
contents are captured by a set S. The 
postcondition asserts that after the exe- } else { 

cution of flatten, the same location x isa let 1 = *(x + 1); 


1 flatten(loc x) { 
2 
3 
4 
head of a doubly-linked list, with the same 5 let r = *(x + 2); 
6 
7 
8 
9 


if (x == 0) { 


elements S as the initial tree (y denotes flatten (1) ; 
the existentially quantified back-pointer ig © En ); 
of the list). The definitions of the two Aa a E E 


predicates, tree and dll, which constrain 10 } I 

the symbolic heaps in the pre- and post- 11 

condition are standard for SL [75] and will 12 helper(loc r, loc 1, 

be shown in Sect. 2. 13 loc x) { 
Given the spec (1), SUSLIK takes less 14 if (1 == 0) { 

than 20s to generate the program in 15 if (r == 0) { 

Fig. 1, written in a core C-like language, 16 } else { 

as well as a formal proof that the pro- 17 *(r + 2) = x; 

gram satisfies the spec. Several things are 18 *(x + 1) =r; 

noteworthy about this program. First, the ae } 

code indeed does not perform any allo- 7 } gee si- 

cation, and instead accomplishes its goal 99 itis «(1 + 1); 

by switching pointers (in lines 17, 18, 23 *(l +2) =T; 

23, and 25); this makes it economical in 24 helper(r, w, 1); 

terms of memory usage as only a low-level 25 *(1 + 2) = x; 

program can be: similar code written in 26 } 

a functional language like OCaml would 27 } 


inevitably rely on garbage collection. Sec- 
ond, the main function flatten relies on Fig. 1. Flattening a tree into a DLL. 
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an auxiliary recursive function helper, which the programmer did not anticipate; in fact 
the need for this auxiliary—and its specification—is discovered by SUSLIK completely 
automatically. All the programmer has to do to obtain a provably correct implementa- 
tion of flatten is to write the spec (1) and define the two SL predicates it uses, which 
are, however, reusable across different programs. 

At this point, a critical reader might be wondering whether this technology is 
mature enough to move past hand-crafted benchmarks and assist them in developing 
the next COMPCERT [48] or CERTIKOS [31]. For one, the program in Fig. 1 does not 
seem optimal: a closer look reveals that the role of helper is to concatenate the lists 
obtained by flattening the two subtrees, resulting in the overall O(n”) complexity wrt. 
the size of the original tree.’ Apart from performance of synthesized programs, the 
reader might have the following concerns: 


— What is the class of programs this approach is fundamentally capable of synthe- 

sizing? How picky is it to the exact shape of input specifications? 

Is the proof search predictably fast across a wide range of problems? 

Will the synthesized code be concise and easy to understand? 

— Finally, what are the “killer apps” for this technology and in which domains can 
we hope for its adoption for practical need? 


The goal of this manuscript is precisely to illustrate the remaining challenges in 
SL-based synthesis of heap-manipulating programs and outline some future research 
directions towards addressing these challenges. In the remainder of this paper we pro- 
vide the necessary background and a survey of the results to date (Sect. 2); we then 
zoom in on the promising techniques for improving proof search (Sect.3); in Sect. 4 
we discuss the completeness of synthesis, outlining the work that needs to be done in 
order to formally characterize the class of programs that can and cannot be generated; 
in Sect.5 we talk about possible extensions to the synthesis procedure for improving 
the quality of synthesized programs; finally, in Sect.6 we discuss possible applica- 
tions of SL-based synthesis, such as program repair, data migration, and concurrent 
programming. 


2 State of the Art 


2.1 Specifications 


SuSLIK takes as input a Hoare-style specification, i.e., a pair of a pre- and a post- 
condition. Consider, for example, a specification for a function swap,” which swaps the 
values of two pointers: 


{xm axy b} swap(loc x, loc y) {x bxy a} (2) 


The precondition x |> a * y > b states that the relevant part of the heap contains two 
memory locations, x and y, which store values a and b, respectively. We also know 
that and x Æ y, because the semantics of separating conjunction (x) require that the 
two heaps it connects be disjoint. The postcondition x +> b x y +> a demands that after 


1 In Sect. 4 we show what it takes to derive an alternative, linear-time solution. 
? Our language has no return statement, hence all functions have return type void, 
which is omitted from the spec; return values are emulated by writing to the heap. 


Deductive Synthesis of Programs with Pointers 113 


executing the function, the values stored in x and y be swapped. This specification 
also implicitly guarantees that swap always terminates and executes without memory 
errors (e.g., null-pointer dereferencing). Note that x and y also appear as parameters 
to swap, and hence are program variables, i.e., can be mentioned in the synthesized 
program; the payloads a and b, on the other hand, are logical variables, implicitly 
universally quantified, and must not appear in the program. In the rest of this paper, 
we distinguish program variables from logical variables by using monotype font for the 
former. 

In general, in a specification {P} £(...) {Q}, assertions P, Q both have the form 
ġ; P, where the spatial part P describes the shape of the heap, while the pure part 
¢ is a plain first-order formula that states the relations between variables (in (2) the 
pure part in both pre- and postcondition is trivially true, and hence omitted). For 
the spatial part, SUSLIK employs the standard symbolic heap fragment of Separation 
Logic [66,75]. Informally, a symbolic heap is a set of atomic formulas called heaplets 
joined with separating conjunction (*). The simplest kind of heaplet is a points-to 
assertion x +> e, describing a single memory location with address x and payload e. An 
empty symbolic heap is represented with emp. 

To capture linked data structures, such as lists and trees, SUSLIK specifications 
use inductive heap predicates, which are standard in Separation Logic. For instance, 
the tree predicate used in (1) is inductively defined as follows: 


tree(x, S) = z = 0 => {S =0;emp} 
yA 0s tS = SU S (3) 
|x, 3] ear ux (x, 1) m l * (x, 2) r x tree(l, S1) * tree(r, Sr) } 


The predicate is parametrized by the root pointer x and the set of tree elements S. This 
definition consists of two guarded clauses: the first one describes the empty tree (and 
applies when the root pointer is null), and the second one describes a non-empty tree. 
In the second clause, a tree node is represented by a three-element record starting at 
address x. Records are represented using a generalized form of the points-to assertion 
with an offset: for example, the heaplet (x, 1) +> l describes a memory location at the 
address x + 1. The block assertion [x,3] is an artifact of C-style memory management: 
it represents a memory block of three elements at address x that has been dynamically 
allocated by malloc (and hence can be de-allocated by free). The first field of the 
record stores the payload v, while the other two store the addresses l and r of the left 
and right subtrees, respectively. The two disjoint heaps tree(l, S1) and tree(r, S,) store 
the two subtrees. The pure part of the second clause indicates that the payload of the 
whole tree consists of v and the subtree payloads, Sı and Sr. 


2.2 The Basics of Deductive Synthesis 


The formal underpinning of SUSLIK is a deductive inference system called Synthetic 
Separation Logic (SSL). Given a pre-/postcondition pair P, Q, deductive synthe- 
sis proceeds by constructing a derivation of the SSL synthesis judgment, denoted 
{P}~»{Q}|c, for some program c. In this derivation, c is the output program, con- 
structed while searching for the proof of the synthesis goal {P} ~ {Q}. Intuitively, 
the output program c should satisfy the Hoare triple {P} c {Q}. The derivation is 
constructed by applying inference rules, a subset of which is presented in Fig. 2, and 
every inference rule “emits” a program fragment corresponding to this deduction. 
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EMP FRAME E 
Feo=>y {oP} + {yu Qhle Frase 
{¢;emp}~+{v;emp}|skip {¢;P* R}~{4Y;Q* R}|ce WRITE 


MP _——________ 
{emp}~ {emp} | skip 


{xpixyat j] xo biky at } | skip 


ire [H bt ty bt} {x dix yal } *y=al 
READ Pee Ax = bi; 
y is fresh ag PV {znat syo bijo daba tym at} y eal 
[y/a]{%; (%1) > a x P} [y/a]{9} |e i Tet br = yi 
{6; (x, 1) > ax P}o {Q} |let y = *(x +2); ¢ {znat yd bofxrs bay at) | *x = bl; 
WRITE , xy = al 
Vars(e) CPV ee’ a let al = *x; 
{0; (x,t) œ e x P}~ {4; (x,t) Hex Q}hic {xa ty b}o{xr beyHa}| aie it 
{03 (x, 0) > e! P} {uj (x, 0) 9 eQ} ¥(x +0) = ec ty al 
Fig. 2. Selected SSL rules (simplified). Fig. 3. Derivation of swap. 


Figure3 shows an SSL derivation for swap, using inference rules of Fig. 2. The 
derivation, read bottom-up, starts with the pre/post pair from (2) as the synthesis 
goal; each rule application simplifies the goal until both the pre- and the post-heap 
are empty, and might also prepend a statement (highlighted in grey) to the output 
program. In the initial goal, the READ rule can be applied to the heaplet x +> a to 
read the logical variable a from location x into a fresh program variable a1; the second 
application of READ similarly reads from the location y. At this point, the WRITE 
rule is applicable to the post-heaplet x —> b1 because its right-hand side only mentions 
program variables and can be directly written into the location x; note that this rule 
equalizes the corresponding heaplets in the pre- and post-condition. After two applica- 
tions of WRITE, the pre- and the post-heap become equal and can be simply cancelled 
out by the FRAME rule, leaving emp on either side of the goal; the terminal rule EMP 
then concludes the derivation. Although very simple, this example demonstrates the 
secret behind SUSLIk’s efficiency: the shape of the specification restricts the set of 
applicable rules and thereby guides program synthesis. 


2.3 Synthesis with Recursion and Auxiliary Functions 


We now return to our introductory example—flattening a binary tree into a doubly- 
linked list—whose specification (1) we repeat here for convenience: 


{tree(x,S)} flatten(loc x) {dll(x, y, S)} 


The definition of the tree predicate has been shown above (3); the predicate dll(x, y, S) 
describes a doubly-linked list rooted at x with back-pointer y and payload set S: 


dil(x,y,S) = £ = 0 => {S =0;emp} 
| 10> {9 = {v}U S; (4) 
[z, 3] * £ — v * (x, 1) | n x (2,2) — y * dil(n, x, S’)} 


Note that in the spec (1) both the set S and the back-pointer y are logical variables, 
but S is implicitly universally quantified (a so-called ghost variable), because it occurs 
in the precondition, while y is existentially quantified (a so-called existential variable), 
because it only occurs in the postcondition. The reader might be wondering why use an 
existential here instead of a null pointer: as we show below, such weakening is required 
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flatten (loc x) { x 
if (x == 0) { 
} else { 
let 1 = *(x + 1); // tree(1, Sı) l r 
let r = *(x + 2); // tree(r, Sr) 
flatten(1); // All, yi, S1) flatten, {flatten 
flatten(r); // All(x, yr, Sr) jl Ë 
APENES // ~ dil(a, y, {v} U Sı U S+) 22? 
} t 
} x 


Fig. 4. Intermediate synthesis state when deriving flatten. 


to obtain the solution in Fig. 1; we discuss the alternative spec and corresponding 
solution in Sect. 4. 

At a high level, the synthesis of flatten proceeds by eagerly making recursive calls 
on the left and the right sub-trees, 1 and r, as illustrated in Fig. 4, which leads to the 
following synthesis goal: 


{[x, 3] #x > v * (x, 1) > 1 * (x, 2) > r x dil(1, y1, S1) * dil(x, Yr, S,)} (5) 
~ {dll(x, y, {v} U Sı US,)} 


Now the synthesizer must concatenate the two doubly-linked lists, rooted at 1 and r, 
together with the parent node x into a single list. Since the spec gives us no access to the 
last element of either of the two lists, this concatenation requires introducing a recursive 
auxiliary function to traverse one of the lists to the end. We now demonstrate how 
SUSLIK synthesizes recursive calls and discovers the auxiliary using a single mechanism 
we call cyclic program synthesis [34], inspired by cyclic proofs in Separation Logic [11, 
76]. The main idea behind cyclic proofs is that, in addition to reaching a terminal rule 
like EMP, a sub-goal can be “closed off” by forming a cycle to an identical companion 
goal earlier in the derivation; in SSL these cycles give rise to recursive calls. 

Figure 5 depicts a cyclic derivation of flatten. For now let us ignore the appli- 
cations of the PROC rule, which do not modify the synthesis goal; their purpose will 
become clear shortly. Given the initial goal (1), SUSLIK first applies the OPEN rule, 
which unfolds the definition of tree in the precondition and emits a conditional with 
one branch per clause of the predicate. The first branch (x = 0) is trivially solved by 
skip, since a null pointer is both an empty tree and an empty list. The second branch 
is shown in Fig.5: its precondition contains two predicate instances tree(1,.5;) and 
tree(r, S+) for the two sub-trees of x. 

Now SUSLIK detects that either of those instances can be unified with the pre- 
condition tree(x,S) of the top-level goal, so it fires the CALL rule, which uses cyclic 
reasoning to synthesize recursive calls. More specifically, CALL has two premises: the 
first one synthesizes a recursive call and the second one the rest of the program after 
the call. The spec of the first premise must be identical to some earlier goal, so that 
it can be closed off by forming a cycle; in our example, the back-link (1) connects the 
first premise back to the top-level goal. Once a companion goal is identified, SUSLIk 
inserts an application of PRoc right above it: its purpose is to delineate procedure 
boundaries, or, in other words, give a name to the piece of code that the CALL rule is 
trying to call. To ensure that recursion is terminating, we must prove that tree(1, S1) 
in the precondition of the CALL’s premise is strictly smaller than tree(x, S) in the pre- 
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(READ), (WRITE) 


(3) 


{1 ... «dll(w, yw, Sw) *dll(r, yr, Sr) }~ {dll(1, yx, S’)} | helper(r, w, 1) eas 

{x = ...* (1,1) > w * dll (w, Yw, Sw) * dll (x, yr, S-)} ~> {dli (x, y, S)} |helper(r, w, nA RET (b) 
(Reap), (OPEN), (WRITE) 

{x He... *dIl(1, yx, S1) * dli (x, yr, Sr) b> {dll (x, y, S)} | if (1 = 0) {...} else {...} 

{x > <- x AIN(L, yu, S1) * d(x, yr, Sr) }~> {dll (x, y, S$) } | helper(r, 1, x) is (a) 


(CALL) 


{tree(1, 5) }~> {dll(1, yı, S1) } | flatten(1) 
{--- x tree(1, S1) * tree(r, S,)}~» {dll (x, y, S)} | flatten(1); flatten(r);... 
: (OPEN), (READ) 
{tree(x, S)}~> {dll(x, y, S)} | if (x = 0) { } else {...} 
—— {tree(x, S)}~ {dll (x, y, S) } | flatten(x) 


(CALL) 


(Proc) 


Fig. 5. Derivation of flatten and its recursive auxiliary helper. 


condition of the companion (see [34] for more details about our termination checking 
mechanism). 

After the second application of CALL (to tree(r, S;)), SUSLIK arrives at the goal (5), 
with two lists in the precondition (marked (a) in Fig. 5). Ignoring again the application 
of PROC, which will be inserted later, SUSLIK proceeds by unfolding the list dll(1, yw, S1) 
via OPEN, eventually arriving at the goal (b): this goal again has two lists in the 
precondition but one of them is now smaller (it is the tail of dli(1, yı, 51)). At this 
point CALL detects that (a sub-heap of) goal (b) can be unified,? with goal (a) thus 
forming the cycle (3), which this time links to an internal goal in the derivation instead 
of the top-level goal. As before SUSLIK inserts an application of the PRoc rule just 
above the companion goal (a), thereby abducing an auxiliary procedure with a fresh 
name. 


2.4 Implementation and Empirical Results 
The most up-to-date implementation of SUSLIK is publicly available at: 
https://github.com/TyGuS/suslik 


Table 1 collects the results of running SUSLIK on benchmarks from our prior work [19, 
34, 70,93] as well as seven new benchmarks, which we added to illustrate various chal- 
lenges discussed in subsequent sections.’ Most existing benchmarks had been adapted 
from the literature on verification and synthesis [24,47,50,72]. In addition to standard 
textbook data structures, our benchmarks include operations on two less common data 
structures, which to the best of our knowledge cannot be handled by other synthesizers. 


3 This is where we rely on the existential back-pointer in (1): if we replace y with 0, 
then dil(1,0, S1) and dll(w, yw, Sw) would not unify. 
4 The code and benchmarks accompanying this paper are available online [35]. 


Deductive Synthesis of Programs with Pointers 117 


Table 1. SuSLIK benchmarks and results. We report the number of Procedures gen- 
erated, total number Stmt of statements in those procedures, the ratio Code/Spec of 
code to specification (in AST nodes), and the synthesis time in seconds for standard 
SUSLIK (Time), with a simpler cost function (TimeSC) and with no bounds on pred- 
icate unfolding and calls (TimeNB). “-” denotes timeout after 30 minutes. Footnotes 


indicate the sources of benchmarks. 


Data structure Id | Description Proc | Stmt | Code/Spec | Time | TimeSC | TimeNB 
Integers 1 | Swap two 1 4 1.0x 0.2 1.2 0.2 
2 | Min of twot 1 3 1.1x 0.8 | 3.0 1.1 
Singly linked list |3 | Length? 1 6 1.4x 0.4 0.5 0.6 
4 | Max? 1 11 1.9x 3.0 |70 4.7 
5 | Min? 1 11 1.9x 2.9 6.7 4.1 
6 | Singleton! 1 4 0.9x 0.2 0.2 0.2 
7 | Deallocate 1 4 5.5x 0.2 0.2 0.2 
8 | Initialize 1 4 1.6x 0.4 0.4 0.6 
9 | Copy? 1 11 2.7x 0.6 | 1.0 393.3 
10 | Append? 1 6 11x 0.4 (0.4 0.6 
11 | Delete? 1 12 | 2.6x 12 10.9 2.0 
12 | Deallocate two | 2 9 6.2x 0.2 0.2 0.2 
13 | Append three |2 14 2.3x 1.0 2.5 1.7 
14 | Non- 2 21  |3.0x 8.0 51.5 - 
destructive 
append 
15 | Union 2 23. «| 5.5x 4.3 | 20.6 36.0 
16 | Intersection* 3 32 7.0x 101.1 | 121.2 - 
17 | Difference* 2 21 5.1x 4.7 | 55.0 29.5 
18 | Deduplicate* |2 22 = |7.3x 18 |2.5 5.5 
Sorted list 19 | Prepend? 1 4 0.4x 0.2 |0.3 0.3 
20 | Insert? 1 19 | 3.1x 1.0 16.2 1.2 
21 | Insertion sort? | 1 7 1.2x 0.7 | 2.7 42.7 
22 | Sort4 2 13 | 4.9x 10 (1.5 2.9 
23 | Reverse* 2 11 | 4.0x 0.7 0.7 1.4 
24 | Merge? 2 30 | 4.4x 55.6 10.1 - 
Doubly linked list | 25 | Singleton? 1 5 1.1x 0.2 0.2 0.5 
26 | Copy 1 22 |4.3x 7.2 9.9 E 
27 | Append? 1 10 1.6x 1.7 | 27.2 : 
28 | Delete? 1 19 | 3.7x 3.4 |35 - 
29 | Single to 1 23 6.0x 0.7 0.8 4.6 
double 
List of lists 30 | Deallocate 2 11 10.7x 0.2 0.3 0.3 
31 | Flatten 2 17 |4.4x 0.6 (0.6 1.9 
32 | Length 21 | 5.5x 22.8 | - - 


(continued) 
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Table 1. (continued) 


Data structure | Id | Description Proc | Stmt | Code/Spec | Time | TimeSC | TimeNB 
Binary tree 33 | Size 1 9 2.5x 0.4 0.6 185.8 
34 | Deallocate 1 6 8.0x 0.2 0.2 0.2 
35 | Deallocate two | 1 16 11.8x 0.4 0.5 0.5 
36 | Copy 1 16 3.8x 2.5 42.9 - 
37 | Flatten 1 17 4.8x 0.4 0.6 0.7 
w/append 
38 | Flatten w/acc 1 12 2.1x 0.6 0.9 1.9 
39 | Flatten 2 23 7.1x 1.5 1.0 5.5 
40 | Flatten to dll in | 2 15 9.6x 11.3 l- 23.2 
place 
41 | Flatten to dll 2 17 11.2x 106.1 1418.3 | 46.5 
w/null® 
BST 42 | Insert? 1 19 2.8x 14.6 | 21.7 518.0 
43 | Rotate left? 1 0.2x 6.2 | 7.0 - 
44 | Rotate right? 1 0.2x 4.9 |56 - 
45 | Find min® 1 11 1.4x 66.3 | 80.2 - 
46 | Find max® 1 18 |2.2x 58.0 | 80.8 - 
47 | Delete root? 1 18 | 1.3x 13.9 | - - 
48 | From list* 2 27. |5:7x 10.0 | 10.7 - 
49 | To sorted list* |3 32 |7.7x 20.8 11.7 - 
Rose tree 50 | Deallocate 2 9 12.0x 0.2 0.3 0.2 
51 | Flatten 3 25 8.0x 11.0 6.3 - 
52 | Copy® 2 32 |7.9x - : 2 
Packed tree 53 | Pack 1 16 1.6x - - - 
54 | Unpack® 1 23  |2.9x 21.0 |- - 
1 Jennisys [47] ? IMPSYNTH [72] 3 Dryap [50] 4 Eguchi et al. [24] 5 New 


A rose tree [51] is a variable-arity tree, where child nodes are stored in a linked list; it 
is described in SL by two mutually recursive predicates (rtree for the tree and children 
for the list of children), and our synthesized operations on rose trees are also mutu- 
ally recursive. A packed tree is a binary tree serialized into an array; it is interesting 
because operations on packed trees use non-trivial pointer arithmetic (we discuss them 
in Sect. 6). 

Apart from the size of each program (in statements), we also report the ratio 
of code size to spec size (both in AST nodes) as a measure of synthesis utility. For 
the majority of the benchmarks the generated code is larger than the specification, 
sometimes significantly (up to 122); the only exceptions are benchmarks with very 
convoluted specs, such as BST rotations (benchmarks 43 and 44), or extremely simple 
programs, such swap from Fig. 3 (benchmark 1) and prepending an element to a sorted 
list (benchmark 19). 

A number of benchmarks generate more than one procedure: those programs require 
recursive auxiliaries [34], such as our running example flatten from Fig. 1 (bench- 
mark 40). It is worth mentioning that benchmarks 37 through 41 encode different 
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versions of flattening a binary tree into a singly or doubly-linked list: 37 and 38 are 
simplified versions that do not require discovering auxiliaries because they contain addi- 
tional hints from the user (a library function for appending lists in 37 and an inductive 
specification for flatten with a list accumulator in 38); 39 is similar to 40 but returns 
a singly-linked list (and hence requires allocation). Finally 41 is a version of 40 that 
uses 0 instead of y as the back-pointer of the output list; this precludes SUSLIK from 
generating an auxiliary for appending two lists, and instead it discovers a slightly more 
complex, but linear-time solution, which we discuss in Sect. 4. 

The missing synthesis times for some benchmarks indicate that they could not be 
synthesized automatically after 30min, but were possible to solve in an “interactive” 
mode, where the search has been given hints on how to proceed in the case of multiple 
choices. We elaborate on the possibility of generating those programs automatically 
in subsequent sections. Apart from regular SUSLIK time we also report time for two 
variations discussed in Sect. 3. 


3 Proof Search 


Similarly to existing deductive program synthesizers [43], SUSLIK adopts best-first 
AND/OR search [54] to search for a program derivation. The search space is repre- 
sented as a tree with two types of nodes. An OR-node corresponds to a synthesis goal, 
whose children are alternative derivations, any of which is sufficient to solve the goal. 
An AND-node corresponds to a rule application, whose children are premises, all of 
which need to be solved in order to build a derivation. Each goal has a cost, which 
is meant to estimate how difficult it is to solve. The search works by maintaining a 
worklist of OR-nodes that are yet to be explored. In each iteration, the node with the 
least cost is dequeued and expanded by applying all rules enabled according to a proof 
strategy; the node’s children are then added back to the worklist. 

The proof strategy and the cost function are crucial to the performance of the proof 
search. In current SUSLIK implementation both are ad-hoc and brittle; in the rest of 
the section we outline possible improvements to their design. 


3.1 Pruning via Proof Strategies 


A proof strategy is a function that takes in a synthesis goal and its ancestors in the 
search tree, and returns a list of rules enabled to expand that goal. Without strategies, 
the branching factor of the search would be impractically large. SUSLIK’s strategies 
are based on the observation that some orders of rule applications are redundant, and 
hence can be eliminated from consideration without loss of completeness. Identifying 
redundant orders is non-trivial and is currently done informally, increasing the risk of 
introducing incomplete strategies. 

For example, SUSLIK’s proof strategy precludes applying CALL if CLOSE (a rule 
that unfolds a predicate in the postcondition) has been applied earlier in the deriva- 
tion. The reasoning is that CALL only operates on the precondition, while CLOSE only 
operates on the postcondition, hence the two rule applications must be independent, 
and can always be reordered so that CALL is applied first. But it gets more complicated 
once we let CALL abduce auxiliaries: now applying CALL after CLOSE could be useful 
to give it access to more companion goals, whose postconditions differ from that of the 
top-level goal. Consider for example copying a rose tree with the following spec: 


{r > x * rtree(x, S)} void rtcopy(loc r) {r > y * rtree(y, S) * rtree(x, S)} (6) 
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Copying a rose tree seems to require two mutually-recursive procedures: the main one 
(6) that copies an rtree and an auxiliary one that copies the list of its children, and hence 
has children instead of rtree in its postcondition. To our surprise, however, our proof 
strategy does not preclude the derivation of rtcopy (see benchmark 52 in Table 1): in 
this derivation, the auxiliary returns two rtrees, which are then unfolded after the call 
to extract the relevant children. 


Future Directions. To develop more principled yet efficient strategies, we need to turn 
to the proof theory community, which has accumulated a rich body of work on efficient 
proof search. One technique of particular interest—focusing [53|—defines a canonical 
representation of proofs in linear logic [29] (more precisely, a canonical ordering on 
the application of proof rules, which can be enforced during the search by tracing 
local properties). Existing program synthesis work [27,79] has leveraged ideas from 
focusing, but only in the setting of type inhabitation for pure lambda calculi. SUSLIK 
takes advantage of some of these ideas, too: it designates some rules, such as READ 
and logical normalization rules, to be invertible; these rules can be applied eagerly and 
need not be backtracked. Beyond focusing, we might explore the applicability of more 
advanced canonical representations of programs and proofs [1,33,79]. We believe that 
these techniques will help us formalize and leverage inherent SSL symmetries, such 
as that two programs operating on disjoint parts of the heap can be executed in any 
order. 


3.2 Prioritization via a Cost Function 


When selecting the next goal to expand, SUSLIk’s best-first search relies on a heuristic 
cost function of the form (with p, w > 1): 


cost({¢, P} ~ {v~, Q}) = p * cost(P) + cost(Q) cost(p(é)“°) = wx (1+u+c) 
cost(P x Q) = cost(P) + cost(Q) cost(_) = 1 


In other words, a cost of a synthesis goal is a (weighted) total cost of all heaplets in 
its pre- and postcondition. The intuition is that the synthesizer needs to eliminate all 
these heaplets in order to apply the terminal Emp rule, so each heaplet contributes to 
the goal being “harder to solve”. Predicates are more expensive than other heaplets, 
because they can be unfolded and produce more heaplets. In addition, for each predicate 
instance p(é)""* SUSLIK keeps track of the number of times it has been unfolded (u) 
or has gone through a call (c); factoring this into the cost prevents the search from 
getting stuck in an infinite chain of unfolding or calls. Finally, it can be useful to give 
a higher weight to the heaplets in the precondition, because many rules that create 
expensive search branches (most notably CALL) operate on the precondition. 

Our implementation currently uses p = 3,w = 2, which is a result of manual 
tuning. Column TimeSC in Table 1 shows how synthesis times change if we set p = 1. 
As you can see, SUSLIK’s performance is quite sensitive even to this small change: four 
benchmarks, which originally took under 30s, now time out after 30 minutes, while 
benchmark 24, on the contrary, is solved five times faster. These results suggest that 
different synthesis tasks benefit from different search parameters, and that we might 
need a mechanism to tune SUSLIk’s search strategy for a given synthesis task. 

In addition, because the cost heuristic is not efficient enough at guiding the search, 
we introduce hard bounds on the number of unfoldings and calls u and c for a predicate 
instance. Column TimeNB in Table1 shows the results of running SUSLIK without 
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these bounds: as you can see, 19 benchmarks time out (compared to only two in the 
original setup). The requirement to guess sufficient bounds for each benchmark hampers 
the usability of SUSLIK, hence in the future we would like to replace them with a better 
cost function. 


Future Directions. To guide the search in a more intelligent and flexible way, we turn 
to extensive recent work on using learned models to guide proof search [8, 28, 49, 78, 95] 
and program synthesis [5,15,39,46,55,82]. Guiding deductive synthesis would most 
likely require a non-trivial combination of these two lines of work. 

In the area of proof search, existing techniques are used to select the next strategy in 
a proof assistant script [59,60, 78,95], or select a subset of clauses to use in a first-order 
resolutions proof [9,49]. Although these techniques are not directly applicable to our 
context, we can likely borrow some high-level insights, such as two-phased search [49], 
which applies a slow neural heuristic to make important decisions in early stages of 
search (e.g., which predicate instances to unfold), and then less accurate but much 
faster hand-coded heuristics take over. Among the many techniques for guiding program 
synthesis, neural-guided deductive search (NGDS) [39] might be the natural place to 
start, since it shows how to condition the next synthesis step on the current synthesis 
sub-goal. 

At the same time we also expect the limited size of the available dataset (7.e., the 
benchmarks from Table 1) would hamper the application of deep learning to SUSLIK. 
An alternative approach is to encode feature extractors [58] and apply machine learning 
algorithms to the result of such feature extractors. Another approach is to learn a 
coarse-grained model from available data and then adjust it during search, based on 
the feedback from incomplete derivations, as in [6,15,82]. 


4 Completeness 


Soundness and completeness are desirable properties of synthesis algorithms. In our 
case, it is natural to formalize these properties relative to an underlying verification 
logic, which defines Hoare triples {P} c {Q}, with the total correctness interpretation 
“starting from a state satisfying P, program c will execute without memory errors and 
terminate in a state satisfying Q”. This logic can be defined in the style of SMALL- 
FOOT [7], using a combination of symbolic execution rules and logical rules, with the 
addition of cyclic proofs to handle recursion [76]. 

Relative soundness means that any solution SuSLIk finds can be verified: 
VP,Q,c.P~Q|c => {P} c {Q}. Relative completeness means that whenever 
there exists a verifiable program, SUSLIK can find one: VP,Q.(Sc.{P} c {Q}) => 
(ac. P~ Q|c’). Proving relative soundness is rather straightforward, because SSL rules 
are essentially more restrictive versions of verification rules, hence an SSL derivation 
can be rewritten by translating every P~»Q|c into {P} c {Q}.° Completeness on the 
other hand is quite tricky, exactly because SSL rules impose more restrictions on the 
pre- and postconditions, in order to avoid blind enumeration of programs and instead 
guide synthesis by the spec. In the rest of this section we look into two major sources 
of relative incompleteness of SSL: recursive auxiliaries and pure reasoning. 


5 In our recent work we have developed an automatic translation from SSL derivations 
into three Coq-based verification logics [93]. 
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4.1 Recursive Auxiliaries 


A common assumption and source of incompleteness in recursive program synthe- 
sis [43,67,69] is that (1) synthesis is performed one function f at a time: if auxiliaries 
are required, their specifications are supplied explicitly; and (2) the specification ® of f 
is inductive: one can prove that ® holds of f’s body assuming it holds of each recursive 
call. This restriction hampers the usability of synthesizers, because the user must guess 
all required auxiliaries and possibly generalize to make it inductive, which in most 
cases requires knowing the implementation of f. As we have shown in Sect. 2, SUSLIK 
mitigates these limitations to some extent, as it is able to discover auxiliary functions, 
such as helper in Fig. 1, automatically. To make the search tractable, however, cyclic 
synthesis restricts the space of auxiliary specifications considered by SUSLIK to syn- 
thesis goals observed earlier in the derivation. Although this restriction is easy to state, 
we still do not have a formal characterization (or even a firm intuitive understanding) 
of the class of auxiliaries that SSL fundamentally can and cannot derive. Below we 
illustrate the intricacies on a series of examples. 


1 intersect (loc r, y) 13 insert(int v, loc x, r, y) { 
2{ 14 let z = *r; 

3 let x = #r; 15 if (y == 0) { free(x); } 
4 if (x == 0) { 16 else { 

5 } else { 17 let vy = *y; 

6 let v = *x; 18 let n = *(y + 1); 

7 let n = *(x + 1); 19 if (v == vy) { 

8 *r = n; 20 *(x + 1) = z; 

9 intersect(r, y); 21 *r = X; 

10 insert(v, x, r, y); 22 } else { 

11 } 23 insert(v, x, r, n); 
12 } 24 }}} 


Fig. 6. Intersection of lists with unique elements. This implementation cannot be syn- 
thesized from (7), but a slight modification of it can, as explained in the text. 


Generalizing Pure Specs. One reason SUSLIK might fail to abduce an auxiliary is 
that the pure part of the companion’s goal might be too specific for the recursive call. 
Let us illustrate this phenomenon using the list intersection problem (benchmark 16 
in Table 1) with the following specification, where ulist denotes a singly-linked list with 
unique elements: 


{r > z * ulist(z, Sx) * ulist(y, Sy)} ~œ {r | z x ulist(z, Se N Sy) * ulist(y, Sy)} (7) 


Given this specification, we expected SUSLIK to generate the program shown in Fig. 6. 
To compute the intersection of two input lists rooted in x and y, this program first 
computes the intersection of y and the tail of x (line 9). The auxiliary insert then 
traverses y to check if it contains v (the head of x), and if so, inserts it into the 
intermediate result z (line 23), and otherwise, de-allocates the node x (line 15). This 
program, however, cannot be derived by SSL; to see why let us take a closer look at 
the synthesis goal after line 9, which serves as the spec for insert: 
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{Sz = {v} U S1 AVE SI A Sz = SiN Sy;r | z * ulist(z, Sz) * ulist(y, Sy) * 
xmv... pu {S = Ss N Syr z x ulist(z’, $2) * ulist(y, Sy) } (8) 


The issue here is that the pure spec is too specific: the precondition S- = S1 N Sy 
and the postcondition S} = Ss N Sy define the behavior of this function in terms of the 
elements of input lists x and y, but the recursive call in line 23 replaces y with its tail 
n so these specifications do not hold anymore. The solution is to generalize the pure 
part of spec (8), so that it does not refer to Sz: 


{v é Sz;r m z * ulist(z, Sz) * ulist(y, Sy) * x= v*...} 
~ {Sz = S- U ({v} NSy);r = z’ x ulist(z’, 83) * ulist(y, Sy)} (9) 


Alas, such a transformation of the pure spec is beyond SUSLIK’s capabilities. 
To our surprise, SUSLIK was nevertheless able 
to generate a solution to this problem by finding 
an alternative implementation for insert, shown 13 insert(int v, loc x, 
on the right. This implementation appends v to z r, y) f 
instead of prepending it; more specifically, insert Dee z = #r; 
starts by traversing z, and once it reaches the 15 if (z == 0) { 


base case, it calls another auxiliary, intersectOne 1$ intersect0ne (v, 

i i : x, r, y) 
(omitted for brevity), which traverses y and returns ir Perse { 
a list whose elements are {v} N Sy (i.e., a list with 18 Let v2 Sz: 
at most one element), which is then appended to 19 let n = *(z + 1); 
the intersection. At a first glance it is unclear how 20 *r = n; 
this superfluous traversal of z can possibly help 21 ¥Z = V; 
with generalizing the spec (8); the key to this mys- 22 insert(v, Z, r, 
tery lies in the recursive call in line 22: note that as y); 
the second parameter, instead of the input list x, 23 y 

24 


it actually uses z after replacing its head element 
with v! This substitution makes the overly restrictive spec of (8) actually hold. 

Of course this implementation is overly convoluted and inefficient, so in the future 
we plan equip SUSLIK with the capability to generalize pure specs. To this end, 
we plan to combine deductive synthesis with invariant inference techniques via bi- 
abduction [86]. For instance, whenever the CALL rule identifies a companion goal, we 
can replace its pure pre- and post-condition ¢ and ~ with unknown predicates Ug 
and U,,. During synthesis, we would maintain a set of Constrained Horn Clauses over 
these unknown predicates (starting with: ¢ > Ug and Uy = w); these constraints 
can be solved incrementally, like in our prior work [69], pruning the current derivation 
whenever the constraints have no solution. If synthesis succeeds, the assignment to Ug 
and Uy corresponds to the inductive generalization of the original auxiliary spec. Since 
only the pure part of the spec is generalized, the spatial part can still be used to guide 
synthesis. 


Accumulator Parameters. It is common practice to introduce an auxiliary recursive 
function to thread through additional data in the form of “accumulator” inputs or 
outputs. Cyclic program synthesis has trouble conjuring up arbitrary accumulators, 
since it constructs auxiliary specifications from the original specification via unfolding 
and making recursive calls. 
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Consider linked list reversal (23 in Table1): SUSLIK generates an inefficient, 
quadratic version of this program, which reverses the tail of the list and then appends 
its head to the result (hence discovering “append element” as the auxiliary). The canon- 
ical linear-time version of reversal requires an auxiliary with two list arguments—the 
already reversed portion and the portion yet to be reversed—and hence is outside of 
SUSLIK’s search space: cyclic synthesis cannot encounter a precondition with two lists, 
as it starts with a single list predicate in the precondition, and neither unfolding nor 
making a call can duplicate it. 


1 flatten (loc x) { 14 *(1 + 2) = x; 

2 if (x == 0) f 15 } 

3 } else { 16 } else { 

4 let 1 = *(x + 1); 17 let rl = *(r + 1); 
5 let r = *(x + 2); 18 let rr = *(r + 2); 
6 flatten(1); 19 *(r + 2) = rl; 

7 helper(r, 1, x); 20 *(r + 1) = l; 

8 } 21 helper(rl, 1, r); 
9 } 22 *(x + 2) = rr; 

10 23 *(x + 1) = r; 

11 helper (loc r, loc 1, loc x) { 24 helper(rr, r, x); 
12 if (r == 0) { 25 } 

13 if (1 == 0) {} else { 26 } 


Fig. 7. Flattening a tree into a DLL in linear time. 


There are examples, however, where SUSLIK surprized us by inventing the necessary 
accumulator parameters. Consider again our running example, flattening a tree into 
a doubly-linked list. Recall that given the spec (1), SUSLIK synthesizes an inefficient 
implementation with quadratic complexity. A canonical linear-time solution requires an 
auxiliary that takes as input a tree and a list accumulator, and simply prepends every 
traversed tree element to this list; because of the accumulator parameter, discovering 
this auxiliary seems to be outside of scope of cyclic synthesis. To our surprise, SUSLIK 
is actually able to synthesize a linear-time version of flatten, shown in Fig.7 (and 
encoded as benchmark 41 in Table 1), given the following specification: 


{tree(x, S)} flatten (loc x) {dll(x,0,S)} (10) 


Compared with (1), the existential back-pointer y of the output list is replaced with 
the null-pointer 0, precluding SUSLIK from traversing the output of the recursive call 
(cf. Sect. 2), which in this case comes in handy, since it enforces that every tree element 
is traversed only once. 

The new solution starts the same way as the old one, by flattening the left sub-tree 
1, which leads to the following synthesis goal after line 6: 


{dll(1, 0, S1) * tree(x, S») * [x, 3] * x => v * ...} ~œ {dll(x, 0, {v} US; U S,}) (11) 


As you can see, the precondition now contains a tree and a list! Since it can- 
not recurse on the list dll(1, 0, S1), the synthesizer instead proceeds to unfold the tree 
tree(r, S+) and then use (11) as a companion for two recursive calls on r’s sub-trees, 
turning (11) into a specification for helper in Fig. 7. 
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4.2 Pure Reasoning 


To enable synthesis of the wide range of programs demonstrated in Sect. 2, SUSLIK 
must support a sufficiently rich logic of pure formulas. Our implementation currently 
supports linear integer arithmetic and sets, but the general idea is to make SUSLIK 
parametric wrt. the pure logic (as long as it can be translated into an SMT-decidable 
theory), and outsource all pure reasoning to an SMT solver. 

In the context of synthesis, however, outsourcing pure reasoning is trickier than 
it might seem (or at least trickier than in the context of verification). Consider the 
following seemingly trivial goal: 


{x= a +10} ~ {x a+11} (12) 


This goal can be solved by incrementing the value stored in x, i.e., by the program 
let al = *x; *x = al + 1. Verifying this program is completely straightforward: a 
typical SL verifier would use symbolic execution to obtain the final symbolic state 
{x= a + 10 + 1}, reducing verification to a trivial SMT query Ja.a +10 +1 #a+11. 
Synthesizing this program, on the other hand, requires guessing the program expression 
a1 + 1, which does not occur anywhere in the specification. 

To avoid blind enumeration of program expressions, SUSLIK attempts to reduce 
the goal (12) to a syntaz-guided synthesis (SyGuS) query [2]: 


W 


fNzx,a,a1.a1 =a +10 = f(x,aı)=a+11 


Queries like this can be outsourced to numerous existing SyGuS solvers [3,32, 46,77]; 
SUSLIK uses CVC4 [74] for this purpose. Because SyGuS queries are expensive, the 
challenge is to design SSL rules to issue these queries sparingly. 


EMP 


{ai =a + 1;emp} ~ {emp} | skip 


FRAME 
fai =a+1; aati} {laa Fi} | skip 


WRITE 
{a =a+ lx a} {xr ai +1} | x*x=a +1 
SOLVE-4 
{ai Satijxoal~ fysat2;x4 7) | *x=a +1 
J-INTRO 
{a =a+ lixe a} {xo aFi}] *x=aı +1 
READ 


{x AFI} ~ {x a+ 1} | let ay = *x; *x = aı + 1 


Fig. 8. SSL derivation for goal (12). 


Figure 8 shows how two pure reasoning rules, J3-INTRO and SoOLvE-3, work together 
to solve the goal (12). 3-INTRO is triggered by the postcondition heaplet x +> a +1, 
whose right-hand side is a ghost expression, which blocks the application of WRITE. 
J-INTRO replaces the ghost expression with a program-level existential variable y (i.e., 
an existential which can only be instantiated with program expressions). Now SOLVE-3 
takes over: this rule constructs a SyGuS query using all existentials in the current goal 
as unknown terms and the pure pre- and post-condition as the SyGuS specification. 
In this case, the SyGuS query succeeds, replacing the existential y with the program 
term a1 + 1. From here on, the regular WRITE rule finishes the job. 
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Note that although the goal (12) is artificially simplified, it is extracted from a 
real problem: benchmark 32 in Table1, length of a list of lists. In fact the versions of 
SUSLIK reported in our previous work were incapable of solving this benchmark because 
they were lacking the 4-INTRO rule, which we only introduced recently. Although the 
current combination of pure reasoning rules works well for all our benchmarks, it is 
still incomplete (even modulo the completeness of the pure synthesizer), because, for 
efficiency reasons, SOLVE-J only returns a single solution to the SyGuS problem, even 
if the pure specification allows for many. This might be insufficient when SOLVE-d is 
called before the complete pure postcondition is known (for example, to synthesize 
actual arguments for a call). Developing an approach to outsourcing pure reasoning 
that is both complete and efficient is an open challenge for future work. 


5 Quality of Synthesized Programs 


Should we hope that the output of deductive synthesis will be directly integrated into 
high-assurance software, we need to make sure that the code it generates is not only 
correct, but also efficient, concise, readable, and maintainable. The current implementa- 
tion of SUSLIk does not take any of these considerations into account during synthesis; 
in this section we discuss two of these challenges, and outline some directions towards 
addressing them. 


5.1 Performance 


We have already mentioned examples of SUSLIK solutions with sub-optimal asymptotic 
complexity in Sect. 4: for example, SUSLIK generates quadratic programs for linked list 
reversal and tree flattening instead of optimal linear-time versions. Although a linear- 
time solution to tree flattening from Fig. 7 is actually within SUSLIk’s search space 
(even with the more general spec (1)), SUSLIk opts for the sub-optimal one simply 
because it has no ability to reason about performance and hence has no reason to 
prefer one over the other. 

To enable SUSLIK to pick the more efficient of the two implementations, we can 
integrate SSL with a resource logic, such as [56], following the recipe from our prior 
work on resource-guided synthesis [44]. One option is to annotate each points-to heaplet 
x=” e with non-negative potential p, which can be used to pay for execution of state- 
ments, according to a user-defined cost model. Predicate definitions can describe how 
potential is allocated inside the data structure; for example, we can define a tree with 
p units of potential per node as follows: 


tree(x, S,p) £ x = 0 => {S =0;emp} 
| «40> {S= {vo} US, US 
[x, 3) * £ =? vx (x, 1) Lx (x, 2) m r x tree(l, Sı, p) * tree(r, Sr, p)} 


We can now annotate the specification (1) with potentials as follows: 
{tree(x, S,2)} flatten (loc x) {dll(x,y, S,0)} (13) 


If we define the cost of a procedure call to be 1, and the cost of other statements to 
be 0, this specification guarantees that flatten only makes a number of recursive calls 
that is linear in the size of the tree (namely, two calls per tree element). With this 
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specification, the inefficient solution in Fig. 1 does not verify: since helper traverses 
the list r, it must assign some positive potential to every element of this list in order to 
pay for the call in line 24, but the specification (13) assigns no potential to the output 
list. On the other hand, the efficient solution in Fig. 7 verifies: after the recursive call to 
flatten in line 6 we obtain {dll(1, y, Si, 0) * tree(r, S;,2) *...}; helper verifies against 
this specification since it only traverses the tree r and hence can use the two units of 
potential stored in its root to pay for the two calls in lines 21 and 24. In fact, the user 
need not guess the precise amount of potential p = 2 in the spec (13): any constant 
p > 2 would work to reject the quadratic solution and admit the linear one. 


5.2 Readability 


Although readability is hard to quantify, we have noticed several patterns in SuS- 
Lik-generated code that are obviously unnatural to a human programmer, and hence 
need to be addressed. Perhaps the most interesting problem arises due to inference 
of recursive auxiliaries: because SUSLIK has no notion of abstraction boundaries, the 
allocation of work between the different procedures is often sub-optimal. One exam- 
ple is benchmark 39 in Table 1, which flattens a binary tree into a singly-linked list. 
This example is discussed in detail in our prior work [34]; the solution is similar to 
flatten from Fig.1, except that this transformation cannot be performed in-place: 
instead, the original tree nodes have to be deallocated, and new list nodes have to 
be allocated. Importantly, in SUSLIk’s solution, tree nodes are deallocated inside the 
helper function, whose main purpose is to append two lists. A better design would 
be to perform deallocation in the main function, so that helper has no knowledge of 
tree nodes whatsoever. To address this issue in the future we might consider different 
quality metrics when abducing specs for auxiliaries, such as encouraging all heaplets 
generated by unfolding the same predicate to be processed by the same procedure. 


6 Applications 


6.1 Program Repair 


In our statement of the synthesis problem, complete programs are generated from 
scratch from Hoare-style specifications. But what if the program is already written 
previously but is buggy—would it be possible to automatically find a fix for it if we 
know what its specification is? This line of research, employing deductive synthesis for 
automated program repair [30], known as deductive program repair, has been explored 
in the past for functional programs [42] and simple memory safety properties [90], and 
only recently has been extended to heap-manipulating programs using the approach 
pioneered by SUSLIK [63]. 

The SL-based deductive repair relies on existing automated deductive verifiers [17] 
to identify a buggy code fragment (which breaks the verification), followed by the 
discovery of the correct specification, which is used for the subsequent synthesis of 
the patch. The main shortcoming of the existing SL-based repair tools is the need to 
provide the top-level specs for the procedures in order to enable their verification (and 
potential bug discovery) in the first place. As a way to improve the utility of those 
tools, a promising direction is to employ existing static analyzers, such as INFER [12], 
to derive those specifications by abducing them from the usages of the corresponding 
functions [13]. 
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6.2 Data Migration and Serialization 


The pay-off of deductive synthesis is especially high for programs like tree flatten- 
ing, which change the internal representation of a data structure without changing its 
payload; these programs usually have a simple specification, while their implementa- 
tions can get much more intricate. One example where such programs can be useful 
is migration of persistent data: thanks to recent advancements in non-volatile memory 
(NVM) [40,45,84], large amounts of data are now persistently stored in memory, in 
arbitrary programmer-defined data structures. If the programmer decides to change the 
data structure, data has to be migrated between the old and the new representations, 
and writing those migration functions by hand can be tedious. In addition, reallocat- 
ing large data structures is often prohibitively expensive, so the migration needs to 
be performed in-place, without reallocation. As we have demonstrated in our running 
example, this is something that can be easily specified and synthesized in SUSLIK. 


> ptree'(x, tag, n, S) Ê tag = 1 > {n = 1 A S = {v}; (z, 1) = v} 
| tag = 0 = {n=lt+mt+n-A 


4 
et - ptree(x, n, S) £ {x +> tag * ptree' (x, tag, n, S)} 
J 


S={o} us, US: 
(x, 1) +> v x ptree(x + 2, nı, S1) 
4 aGEGEGE x ptree(a + 2- (1+ ni), nr, Sr)} 


Fig. 9. (Left) Pointer-based and packed representations of the same binary tree. (Right) 
An SL predicate for packed trees. 


Another real-world application of this kind of programs is data serialization and 
de-serialization, where data is transformed back and forth between a standard pointer- 
based representation and an array so that it can be written to disk or sent over the 
network [16,91]. For example, Fig.9 shows a pointer-based full binary tree and its 
serialized (or packed) representation, where the nodes are laid out sequentially in pre- 
order [92]. The right-hand-side of the figure shows an SL predicate ptree that describes 
packed trees: every node x starts with a tag that indicates whether it is a leaf; if x is 
not a leaf, its left child starts at the address x + 2 and its right child at x +2- (1 +n), 
where n is the size of the left child, which is typically unknown at the level of the 
program. 

Imagine a programmer wants to synthesize functions that translate between these 
two representations, i.e., pack and unpack the tree. The most natural specification for 
unpack would be: 


rr y * packed(z, sz, 9) (14) 


{r > x * packed(x, sz, S)}unpack_simple(loc of re ee 


This specification, however, cannot be implemented in SSL: when z is an internal node, 
we do not know the address of its right subtree, so we have nothing to pass into the 
second recursive call. Instead unpack must traverse the packed tree and discover the 
address of the right subtree by moving past the end of the left subtree; this can be 
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implemented by returning the address past the end of the ptree together with the root 
of the newly built tree, as a record: 


{r= a (r,1) 4 _«...} umpack(loc r) {r= x+2-sz*(x,1) He yx...} (15) 


With this specification, SUSLIK is able to synthesize unpack in 20s (benchmark 54 in 
Table 1); as for pack (benchmark 53), it is within the search space (which we confirmed 
in interactive mode) but automatic search currently times out after 30 minutes. In 
the future, it would be great if SUSLIK could automatically discover an auxiliary with 
specification (15), given only (14) as inputs; this is similar to the problem of discovering 
accumulator parameters, which we discussed in Sect. 4, and is outside of capabilities of 
cyclic synthesis at the moment. 


6.3 Fine-Grained Concurrency 


Finally, we envision that deductive logic-based synthesis will make it possible to tackle 
the challenge of synthesizing provably correct concurrent libraries. The most efficient 
shared-memory concurrent programs implement custom synchronization patterns via 
fine-grained primitives, such as compare-and-set (CAS). Due to sophisticated interfer- 
ence scenarios between threads, reasoning about such programs is particularly chal- 
lenging and error-prone, and is the reason for the existence of many extensions of 
Concurrent Separation Logic (CSL) [10,65] for verification of fine-grained concur- 
rency [22, 23, 36,38, 61,85, 87-89]. 

For instance, Fine-Grained Concurrent Separation Logic (FCSL) [61,80,81], takes 
a very specific approach to fine-grained concurrency verification, following the tradi- 
tion of logics such as LRG [25] and CAP [22] and building on the idea of splitting the 
specification of a concurrent library to a resource protocol and Hoare-style pre/post- 
conditions. State-of-the art automated tools for fine-grained concurrency verification 
require one to describe both the protocol and Hoare-style pre/postconditions for the 
methods to be verified [21,94]. We believe, it should be possible to take those two com- 
ponents and instead synthesize the concurrent method implementations. The resource 
protocol will provide an extended set of language primitives to compose programs from. 
Those data structure-specific primitives can be easily specified in FCSL and contribute 
derived inference rules describing when these primitives can be used safely. 
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Abstract. Despite the large number of sophisticated deep neural net- 
work (DNN) verification algorithms, DNN verifier developers, users, and 
researchers still face several challenges. First, verifier developers must 
contend with the rapidly changing DNN field to support new DNN opera- 
tions and property types. Second, verifier users have the burden of select- 
ing a verifier input format to specify their problem. Due to the many 
input formats, this decision can greatly restrict the verifiers that a user 
may run. Finally, researchers face difficulties in re-using benchmarks to 
evaluate and compare verifiers, due to the large number of input formats 
required to run different verifiers. Existing benchmarks are rarely in for- 
mats supported by verifiers other than the one for which the benchmark 
was introduced. In this work we present DNNV, a framework for reducing 
the burden on DNN verifier researchers, developers, and users. DNNV 
standardizes input and output formats, includes a simple yet expressive 
DSL for specifying DNN properties, and provides powerful simplification 
and reduction operations to facilitate the application, development, and 
comparison of DNN verifiers. We show how DNNV increases the support 
of verifiers for existing benchmarks from 30% to 74%. 


Keywords: Deep neural networks - Formal verification - Tool 


1 Introduction 


Deep neural networks (DNN) are being applied increasingly in complex domains 
including safety critical systems such as autonomous driving [3,7]. For such appli- 
cations, it is often necessary to obtain behavioral guarantees about the safety 
of the system. To address this need, researchers have been exploring algorithms 
for verifying that the behavior of a trained DNN meets some correctness prop- 
erty. In the past few years, more than 20 DNN verification algorithms have been 
introduced [2,4,6,8-11,15,21,22,24—27, 29-34, 36], and this number continues to 
grow. Unfortunately, this progress is hindered by several challenges. 

First, DNN verifier developers must contend with a rapidly changing field 
that continually incorporates new DNN operations and property types. While 
supporting more properties and operations may increase the applicable scope 
© The Author(s) 2021 
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Table 1. The network and property formats supported by each verifier. A * indicates 
that only a subset of the full input format specification is supported. 


Verifier Network format Property format Algorithmic approach 
Reluplex [16] Reluplex-NNET Hard-coded Search 

Planet [10] RLV RLV Search 

BaB [6] RLV RLV Search 

BaBSB [6] RLV RLV Search 

MIPVerify [29] | MIPVerify Julia API MIPVerify Julia API | Optimization 

Neurify [30] Neurify-NNET Hard-coded Search-optimization 


DeepZono [25] | ONNX*, ERAN-PYT, ERAN-TF | ERAN Python API 
DeepPoly [26] | ONNX*, ERAN-PYT, ERAN-TF | ERAN Python API 
RefineZono [27] | ONNX*, ERAN-PYT, ERAN-TF | ERAN Python API 
RefinePoly [24] | ONNX*, ERAN-PYT, ERAN-TF | ERAN Python API 


Reachability 
Reachability 
Reachability 
Reachability 


Marabou [17] Reluplex-NNET or ONNX* Marabou Python API | Search 
nnenum [1] ONNX* nnenum Python API | Search-reachability 
VeriNet [14] ONNX* or Neurify-NNET VeriNet Python API | Search-optimization 


of verifiers to real-world problems, it also increases a verifier’s complexity. 
For example, for a verifier such as DeepPoly, supporting additional operations 
requires non-trivial effort to define and prove correctness of new abstract trans- 
formers. For verifiers such as Reluplex or Neurify, supporting new property types 
requires implementing a mapping from those properties onto internal verifier 
structures. 

Second, DNN verifier users carry the burden of re-writing property specifi- 
cations and transforming their models to match a chosen verifier’s supported 
format. That burden is compounded by the diversity of input formats required 
by each verifier, as illustrated in Table 1. There is little overlap between input 
formats for verifiers (only DeepZono and DeepPoly or BaB and BaBSB which 
are algorithmically similar), and even when using the same format (as in the 
case of the popular ONNX format) we find that the underlying operations sup- 
ported are different. This makes it difficult and costly to run multiple verifiers on 
a given problem since the user must understand the requirements of each verifier 
and translate inputs to their formats. While two new formats, VNNLIB [13] and 
SOCRATES [20], have been introduced in an attempt to standardize DNN veri- 
fier input formats, their expressiveness is currently limited and they can require 
writing new conversion tools for networks, as we discuss at the end of Sect. 3.1. 

Finally, DNN verifier researchers face challenges in re-using benchmarks to 
evaluate and compare verifiers. Most benchmarks exist in the format of the ver- 
ifier for which they were introduced, and running other verifiers on that bench- 
mark requires writing custom tooling to translate the benchmark to other for- 
mats, or writing new input parsers for verifiers to support the given benchmark 
format. For example, the ACAS Xu benchmark (described in Sect.5), was orig- 
inally specified with networks in Reluplex-NNET format, and properties hard- 
coded into the verifier. The benchmark was converted, for example, into RLV 
format for BaB and BaBSB, as well as into ONNX with hard-coded properties 
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for RefineZono. Other benchmarks, such as the DAVE benchmark used by Neu- 
rify, has networks specified in Neurify-NNET, and properties hard-coded into 
the verifier. Due to its format, this potentially great benchmark has not been 
used by other verifiers. 

We introduce a framework, DNNV, to reduce the burden on verifier researchers, 
developers, and users. DNNV helps to create and run more re-usable verification 
benchmarks by standardizing a network and property format, and it increases the 
applicability of a verifier to richer properties and real-world benchmarks by per- 
forming property reductions and simplifying DNN structures. 


Fig. 1. DNNV architecture 


As shown in Fig. 1, DNNV takes as input a network in the common ONNX 
input format, a property written in an expressive domain-specific language 
DNNP, and the name of a target verifier. Using the framework and plugins for 
the target verifier, DNNV transforms the problem by simplifying the network 
and reducing the property to enable the application of verifiers that otherwise 
would be unable to run. DNNV then translates the network and property to 
the input format of the desired verifier, runs that verifier on the transformed 
problem, and returns the results in a standardized format. 

The primary contributions of this work are: (1) the DNNV framework to 
reduce the burden on DNN verifier researchers, developers, and users; DNNV 
includes a simple yet expressive DSL for specifying DNN properties, and power- 
ful simplification and reduction operations to increase verifiers’ scope of appli- 
cability, (2) an open source tool implementing DNNV!+, with support for 13 
verifiers, and extensive documentation, and (3) an evaluation demonstrating the 
cost-effectiveness of DNNV to increase the scope of applicability of verifiers. 


2 Background 
A deep neural network M encodes an approximation of a target function 
f: R” — R”. A DNN can be represented as a directed graph Gy = (Vy, Ew), 


where nodes, v € Vy, represent operations and edges, e € Ey, represent input 


1 https://github.com/dlshriver/DNNV. 
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arguments to operations. A node without any incoming edges is an input to the 
DNN. The output of a DNN can be computed by looping over nodes in topolog- 
ical order and computing the value of the node given its inputs. The literature 
on machine learning has developed a broad range of rich operation types and 
explored the benefits of different combinations of operations in realizing accurate 
approximations of different target functions, e.g., [12]. 

Given a DNN, NV: R” — R”, a property, (N), defines a set of constraints 
over the inputs, x — the pre-condition, and a set of constraints over the outputs, 
gy — the post-condition. Verification of ¢(N) seeks to prove or falsify: Vz € R” : 
ox (x) > dy(N(a)). 

A widely studied class of properties is robustness, which originated with the 
study of adversarial examples [28,35]. These properties specify that inputs from 
a specific region of the input space must all produce the same output class. 
Detecting violations of robustness properties has been widely studied, and they 
are a common type of property for evaluating verifiers [10, 25,26, 29,30]. Another 
common class of properties is reachability, which define the post-condition using 
constraints over output values. Reachability properties specify that inputs from 
a given region of the input space must produce outputs within a given region 
of the output space. Such properties have been used to evaluate several DNN 
verifiers [16,17,30]. 

A recent survey on DNN verification [18] classifies these approaches based 
on their type: reachability, optimization, or search, or a combination of these. 
Reachability-based methods compute a representation of the reachable set of 
outputs from an encoding of the set of inputs that satisfy the pre-condition. The 
computed output set is often an over-approximation of the true reachable output 
region. The precision of the computed output region depends on the symbolic 
representation used, e.g., hyper-rectangles, zonotopes, polyhedra. Reachability- 
based methods include [11, 22, 24-27, 34]. Optimization-based methods formulate 
property violations as a threshold for an objective function and use optimization 
algorithms to attempt to satisfy that threshold. Optimization-based methods 
include [2,9,21,29, 33]. Search-based methods explore regions of the input space 
where they then formulate reachability or optimization sub-problems. Search- 
based methods include [6, 10,15, 16, 31,32]. 


3 DNNV Overview 


DNNV remedies several key challenges faced by the DNN verification commu- 
nity. A general overview of DNNV is shown in Fig. 1. DNNV takes in a property 
and network in a standard format, simplifies the network, reduces the property, 
translates the network and property to the input format of the verifier, runs the 
verifier, and translates its output. Each of these components can be customized 
by verifier specific plugins. We explain these components in more detail below. 
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3.1 Input Formats Table 2. The number of ONNX 


operations supported by each 


As shown in Table 1, existing verifiers do not f 
verifier. 


support a consistent, common input format 
for networks and properties. DNNV stan- Verifier |# ONNX operations 
dardizes the input and output formats to aid DNNV 31 — 
the community in creating and running veri- 
i ERAN | 22 
fication benchmarks. 
nnenum | 15 


marabou | 12 
VeriNet | 12 


ONNX. For specifying general deep neu- 
ral network architectures, we choose the open 
source DNN format ONNX [19]. ONNX can 
represent real-world networks, is supported by many common frameworks (e.g., 
PyTorch, MXNet) and conversion tools are available for other frameworks (e.g., 
TensorF low, Keras). Our current implementation supports a subset of the ONNX 
specification that subsumes the subsets of ONNX implemented by the supported 
verifiers. Table 2 shows the number of ONNX operations supported by each of 
the verifiers included in DNNV. DNNV supports 40% more operations than the 
verifier with the next highest support. The ONNX subset supported by DNNV 
is sufficient for almost all existing verification benchmarks, as well as many real- 
world networks including VGG16 and ResNet34. 


DNNP. Due to the lack of 


1 from dnnv.properties import * 
a standard format for specify- 2 from torchvision.datasets import FashionMNIST 
: DNN pr : d 1 3 from torchvision.transforms import ToTensor 
ing properties, we develop, 
a Python-embedded DSL for 5 |N = Network("N") 

r : 6 data = FashionMNIST("/tmp", download=True, 

DNN properties, which we call > ereust Grae ToTens6eO)) 
DNNP. DNNP is designed to 8 __| mean = 0.2860 

9 std = 0.3530 


express any property that can 10 i = Parameter("data_idx", type=int, default=1) 

be verified by existing DNN ver- 11 x = (data[i] [0] [None, :].numpy() - mean) / std 

ifiers in a form that is inde- 12 e = Parameter("epsilon", type=float) / std 

pendent of the network. DNNP 14 | Foral1¢ 

is described in more detail in i | Tiries¢ 

Appendix A of the extended 17 And ( 

version of this paper [23]. 4 oe has vane tas 
We demonstrate DNNP with 20 j, 

an example of a local robust- 3) | e= TGO) == argmax (G9), 

ness property, shown in Fig.2. 23 |) 

The property specifies that, for 

all inputs, x- (Lines 14-23), Fig. 2. Example of a local robustness property 

in the input space (Line 18) specified with DNNP. 

and within a hyper-rectangle of 

radius e centered at the given input x (Line 19), the network should predict 

the same maximum class for both x_ and x (Line 21). For Fashion MNIST, this 


means that for all images within an Lə distance of e (specified on Line 12) from 


oud 
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image 1 of the dataset (selected on Lines 10-11), the network should classify all 
of these images the same as it does for image 1. We first import several Python 
packages that will be useful for specifying the property (Lines 1-3), including 
the dataset used to train the network, and a method for data manipulation. 
Because DNNP allows importing arbitrary Python packages, it enables re-use 
of the same data loading and manipulation methods used to train a network. 
After importing the necessary utilities, we define several variables that will be 
used in the final property expression (Lines 5-12). Two of these variables, 7 on 
Line 10 and e on Line 12 are declared as parameters, which allows them to be 
specified on the command line at run time. The value for e must be provided 
at run time, since no default value is provided. Finally, we define the seman- 
tics of the property specification, using methods provided by DNNP, as well as 
variables defined above (Lines 14-23). 


a = * 
w ijmn Yi ! Wy, Wijma 


BatchNorm[ 
vmv 


] 
bi, =y,/ Vv, *b,+8,- y,* m,/ Vv, 


Fig. 3. Batch Normalization Simplification simplifies a batch norm following a convo- 
lution operation to an equivalent single convolution operation with modified weights 
and bias, while maintaining the strides and pads. 


Other Input Formats. Since the creation of DNNV, two new input formats, 
VNNLIB [13] and SOCRATES [20], have emerged in an attempt to standardize 
the verifier input space. The current draft of VNNLIB also uses ONNX as the 
DNN input format, however it supports a much smaller set of operations than 
DNNV, supporting only 17 ONNX operations. The VNNLIB property format 
is a subset of SMTLIB in which variables of the form X; are implicitly mapped 
to network inputs and variables of the form Y; are implicitly mapped to net- 
work outputs. In its current form, this specification only supports DNN models 
with a single flat input tensor and single flat output tensor, whereas DNNP and 
ONNX can support DNN models with multiple inputs and output tensors of any 
shape. SOCRATES proposes a JSON format containing both the property and 
network specifications. Because DNNV treats networks and properties indepen- 
dently, properties can be re-used for multiple networks, and only a single network 
must be stored to check multiple properties, resulting in a lower storage cost, 
especially for large models. Additionally, while the custom JSON format used 
by SOCRATES requires new DNN translation tools to be written to convert to 
the required format, the ONNX format used by DNNV is commonly available 
in most machine learning frameworks. While we believe that ONNX and DNNP 
are currently the most expressive and easily accessible input formats currently 
proposed, DNNV can provide benefits to any format through DNN simplification 
and property reduction to increase the applicability of all verifiers. 
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3.2 Network Simplification 


In order to allow verifiers to be applied to a wider range of real world networks, 
DNNYV provides tools for network simplification. Network simplification takes in 
an operation graph and applies a set of semantics preserving transformations to 
the operation graph to remove unsupported structures, or to transform sequences 
of operations into a single more commonly supported operation. 

An operation graph Gy = (Vw, Ew) is a directed graph where nodes, 
v € Vy represent operations, and edges e € Ew represent inputs to those 
operations. Simplification, simplify : G — G, transforms an operation graph 
Gw €E G, to an equivalent DNN with more commonly supported structure, 
simplify(Gny) = Gy, such that the resulting DNN has the same behavior as 
the original Vz.N (x) = N’ (x), and uses more commonly supported structures. 

One such simplification is batch normalization simplification, which removes 
batch normalization operations from a network by combining them with a 
preceding convolution operation or generalized matrix multiplication (GEMM) 
operation. This is possible since batch normalization, convolution, and GEMM 
operations are all affine operations. The simplification of a batch normalization 
operation following a convolution operation is shown in Fig.3. If no applicable 
preceding layer exists, the batch normalization layer is converted into an equiva- 
lent convolution operation. This simplification enables the application of verifiers 
without explicit support for batch normalization operations, such as Neurify and 
Marabou, to networks with these operations. 


—> 
ef 

> : DNN H 
—> 


@ = Vx. ((x € X) A (argmax(N(x)) = 4)) @, =Vx. REX) 
> (N@), 2 N@),) > (NE, > N@),) 


(argmax(N(x)) # 4) V 
NG), 2 N(X),) 


(argmax(N(x)) = 4) A 
(N@), <N@),) 


Fig. 4. Property reduction to a local robustness property adds a suffix that classifies 
outputs as violations or non-violations of the original output constraints, and changing 
the property to a common form of robustness property. 


DNNV currently includes 6 additional DNN simplifications, enumerated and 
described in more detail in Appendix B of the extended version of this paper [23]. 


3.3 Property Reduction 


In order to allow verifiers to be applied to more general safety properties, DNNV 
provides tools to reduce properties to a supported form. For instance, properties 
can be translated to local robustness properties, which are required by MIP Verify 
or reachability properties which are required by Reluplex. 
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Property reduction takes in a verification problem, which is comprised of 
a property specification and a network, and encodes it as an equivalid set of 
verification problems with properties in a form supported by a given verifier. 

A verification problem is a pair, Y = (N,¢), of a DNN, N, and a prop- 
erty specification $, formed to determine whether M H ¢ is valid. Reduction, 
reduce : ¥ — P(W), aims to transform a verification problem, (V,¢) = Y € Y, 
to an equivalid form, reduce(w) = {(Ni, 1), ---, (Nk, Øk) }, in which property 
specifications are in a common supported form. As defined, reduction has two 
key properties. The first property is that the set of resulting problems is equivalid 
with the original verification problem. The second property is that the resulting 
set of problems all use the same property type. Applying reduction enables veri- 
fiers to support a large set of verification problems by implementing support for 
a single property type. 

For example, given a network that classifies images of clothing items, a user 
may want to specify that, if the network classifies an image as a coat, then the 
score given to the class of a pullover is not less than the score for the sneaker class. 
The property is specified in the bottom left of Fig. 4. Such a verification problem 
can be difficult to specify for many verifiers. For example, Neurify would require 
writing code to specify linear constraints for the property and re-compiling the 
verifier, and MIPVerify cannot support this property as is. DNNV can reduce 
this verification problem to an equivalent problem with a robustness property. 

A high level overview of this reduction is shown in Fig.4; a more detailed 
description is provided in Appendix C of the extended version of this paper [23]. 


3.4 Input and Output Translation 


Because of the large variety of input formats required by the verifiers, one of 
the primary components of DNNV translates from its internal representation 
of properties and networks to the input formats of each verifier. 

DNNYV also requires an output translator that can parse the results of run- 
ning a verifier and returns sat, unsat, or unknown. If the result is sat, indicating 
a violation was found, DNNV also returns a counter example to the property, 
and validates that it does violate the property by performing inference with the 
network and confirming that the input and output do not satisfy the property. 


4 Implementation 


DNNYV is written in 8400 lines of Python code and is available for download and 
re-use at https: //doi.org/10.5281/zenodo.4883626. Python was chosen due to its 
ubiquitous use for developing deep neural networks. DNNV currently supports 
13 verifiers, and was designed to facilitate the integration of new verifiers. The 
currently supported verifiers are shown in Table 1, along with their original input 
formats, and algorithmic approach. Around 2000 LOC (of the 8400 total LOC) 
are used to integrate these 13 verifiers into DNNV, with Planet requiring the 
most effort at 437 lines, and BaB and BaBSB requiring the least effort with 89 
lines of code due to re-use of the Planet input translator. 
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4.1 Supporting Reuse and Extension 


DNNYV is designed to facilitate the integration of new verifiers. The 5 primary 
components of DNNV, DNN simplification, property reduction, input transla- 
tion, verifier execution, and output translation are designed to be re-usable, and 
to facilitate the implementation of new components by providing utilities for 
traversing and manipulating operation graphs and properties. 

Networks are represented as an operation graph, where nodes represent oper- 
ations in the DNN and edges represent inputs and outputs to those operations. 
The operation graph can also be traversed using a visitor pattern. This pat- 
tern is particularly useful for the development of DNN simplifications and input 
translators. It allows developers to easily traverse computation graphs in order 
to translate operations to the required format. We provide built-in utilities for 
converting from our internal network representation to ONNX, PyTorch, and 
TensorFlow models. The implementation also includes utilities for performing 
pattern matching on operation graphs. We utilize this feature to provide utilities 
that transform a network from an operation graph representation to a sequential 
layer representation, which is particularly useful for the network input translator 
of Neurify, which requires DNNs to have a regular structure of a set of convolu- 
tional layers followed by fully connected layers, all with relu activations. 


4.2 Usage 


DNNV can be run from the command line as follows: python -m dnnv <prop> 
<verifier> --network <name> <path>, where the arguments correspond to a 
DNN model in the ONNX format, a property written in DNNP, and the verifier 
to run. Many additional options can be seen by specifying the -h option. 

After execution, for each verifier, DNNV reports the verification result as 
one of sat (if the property was falsified), unsat (if the property was proven to 
hold), unknown (if the verifier is incomplete and could not prove the property 
holds), or error, along with the reason for error, if an error occurs during DNN 
and property translation, or during verifier execution. DNNV also reports the 
time to translate and verify the property. 


5 Study 


We now examine the applicability of verifiers to existing verification benchmarks 
with and without DNNV. A verification benchmark consists of a set of verifica- 
tion problems which are used to evaluate the performance of a verifier. A problem 
is made of a DNN and a property specification and asks whether the property is 
valid for the given DNN. We consider a verifier to support a benchmark if it can 
be run on that benchmark out of the box. We consider a verifier to have support 
for a benchmark through DNNV if DNNV can be run on that benchmark with 
networks specified using ONNX and properties specified in DNNP, and can 
reduce, simplify, and translate the problem to work with the target verifier. 
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Benchmarks. To evaluate benchmark support, we collected the benchmarks 
used by each of the 13 verifiers supported by DNNV, and determined whether 
each verifier can run on the benchmark out of the box, and also whether they 
could be run on the benchmark when DNNV is applied. The verification bench- 
marks are shown in Table 3 and are also described in more detail in Appendix D 
of the extended version of this paper [23]. Each row of the table corresponds 
to a benchmark, to which we assign a short key for identifying the benchmark. 
For each benchmark, we give the name, some of the verifiers it evaluated, the 
number of properties (##P) and networks (#V), and features that can make it 
challenging for verifiers. These features include whether any properties cannot 
represent their input constraints using hyper-rectangles (~HR), whether any 
network in the benchmark contains convolution operations (C), whether any 
network contains residual structures (R), and whether any network uses any 
non-ReLU activation functions (~ReLU). 


Results. The support of verifiers for each benchmark is shown in Table 4. Each 
row of this table corresponds to one of the 13 verifiers supported by DNNV, and 
each column corresponds to one of the 19 benchmarks identified in Table 3. Each 


Table 3. Verifier benchmarks. 


Features 
Key | Name Uses #P | #N 7AHR/C |R | -=ReLU 
AX | ACAS Xu 1,6, 16, 17,30] 10 | 45 
CD Collision Detection 6, 10, 17] 500} 1 
PM | Planet MNIST 10 7} 1 iw va 
TS | TwinStream 5] 1/81 
PCA | PCAMNIST 6] 12/17 
MM | MIPVerify MNIST 29 10000 | 5 a 
MC | MIPVerify CIFAR10 | [29 10000} 2 MARA 
NM | Neurify MNIST 14,30] 500| 4 v 
NDs | Neurify Drebin 30 500| 3 
NDv | Neurify DAVE 30 200; 1 {v v 
DZM | DeepZono MNIST 25 1700 | 10 MIRAR 
DZC | DeepZono CIFAR10 | [25 1700} 5 a a 
DPM | DeepPoly MNIST 14, 26] 1500| 8 “< v 
DPC | DeepPoly CIFAR10 | [26 800| 5 [< 
RZM | RefineZono MNIST 27 800| 8 <“ 
RZC | RefineZono CIFAR10 | [27 200| 2 a 
RPM | RefinePoly MNIST 24 600| 6 ~ 
RPC | RefinePoly CIFAR1O | [24 300| 3 Vv iw 
VC | VeriNet CIFAR10 14 250| 1 [V 
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Table 4. Benchmark support by each verifier. The left half of the circle is black if 
the verifier can support the benchmark out of the box, and is white otherwise. The 
right half is black if the verifier supports the benchmark through DNNV, and is white 
otherwise. An absent circle indicates that the verifier can not be made to support some 
aspect of the benchmark. 


Benchmark 

sase eee oats. Tao 
Verifier “ORRRMAZAZZZZOQOQQOQHR ema Ss 
Reluplex © 000000090 OOCO0CO 
Planet ee 000099999 999999 
BaB e0000 90999 000000 
BaBSB e 000090999 999909 
MIPVeriy 09O 0996080090 oOo0oo0o0o0o0 
Neurify ©009999000 0000090 
DeepZonn ®© O0 0ž09999990 000000009 
DeepPoly @®@0 0ž00ž0999999000 0000009 
RefneZonon ®0 0O 0ž09999990 000000009 
RefnePoly € 009999990000 000009 
Marabou ®© 090090999 999909 
nnenım 0099990999 0000090 
VeriNet GBcodsId dO BICC IIII IIDC 


cell of the table may contain a circle that identifies the support of the verifier for 
the benchmark. The left half of the circle is black if the verifier can support the 
benchmark out of the box, and is white otherwise. The right half is black if the 
verifier supports the benchmark through DNNV, and white otherwise. An absent 
circle indicates that the verifier can not be made to support some aspect of the 
benchmark. For the benchmarks shown here, this is always due to the presence 
of non-ReLU activation functions in some of the networks in the benchmarks. 

As shown in Table4, DNNV can dramatically increase the support of ver- 
ifiers for benchmarks. For example, the Planet verifier could originally be run 
on 5 of the 19 benchmarks, but could be run on 16 using DNNV. Similarly, the 
nnenum verifier, could originally only be run on 1 of the existing benchmarks, 
but could be run on 13 using DNNV. Of the 223 pairs of verifiers and 
benchmarks for which support may be possible, 166 of them are cur- 
rently supported by DNNV, an increase of over 2.4 times the 68 pairs 
supported without DNNV. 
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6 Conclusion 


We present the DNNV framework for reducing the burden on DNN verifier 
researchers, developers, and users. DNNV standardizes input and output for- 
mats, includes a simple yet expressive DSL for specifying DNN properties, and 
provides powerful simplification and reduction operations to facilitate the appli- 
cation, development, and comparison of DNN verifiers. Our study showed the 
potential of DNNV and we made its implementation available, with support for 
13 verifiers, and extensive documentation. 
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Abstract. Several important models of machine learning algorithms 
have been successfully generalized to the quantum world, with poten- 
tial speedup to training classical classifiers and applications to data ana- 
lytics in quantum physics that can be implemented on the near future 
quantum computers. However, quantum noise is a major obstacle to the 
practical implementation of quantum machine learning. In this work, 
we define a formal framework for the robustness verification and anal- 
ysis of quantum machine learning algorithms against noises. A robust 
bound is derived and an algorithm is developed to check whether or not 
a quantum machine learning algorithm is robust with respect to quantum 
training data. In particular, this algorithm can find adversarial examples 
during checking. Our approach is implemented on Google’s TensorFlow 
Quantum and can verify the robustness of quantum machine learning 
algorithms with respect to a small disturbance of noises, derived from 
the surrounding environment. The effectiveness of our robust bound and 
algorithm is confirmed by the experimental results, including quantum 
bits classification as the “Hello World” example, quantum phase recogni- 
tion and cluster excitation detection from real world intractable physical 
problems, and the classification of MNIST from the classical world. 


Keywords: Quantum machine learning - Robustness verification - 
Adversarial examples - Robust bound 


1 Introduction 


In the last few years, the successful interplay between machine learning and 
quantum physics shed new light on both fields. On the one hand, machine learn- 
ing has been dramatically developed to satisfy the need of the industry over 
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the past two decades. At the same time, many challenging quantum physical 
problems have been solved by automated learning. Notably, inaccessible quan- 
tum many-body problems have been solved by neural networks, one instance 
of machine learning [1]. On the other hand, as the new model of computation 
under quantum mechanics, quantum computing has been proved that it can 
(exponentially) speed up classical algorithms for some important problems [2]. 
This motivates the development of quantum machine learning and provides the 
possibility of improving the existing computational power of machine learning 
to a new level (see the review papers [3,4] for the details). After that, quantum 
machine learning was integrated into solving real world problems in quantum 
physics. One essential example is that quantum convolutional neural networks 
inspired by machine learning were proposed to implement quantum phase recog- 
nition [5]. Quantum phase recognition asks whether a given input quantum state 
belongs to a particular quantum phase of matter. At the same time, more prov- 
able advantages of quantum machine learning than the classical counterpart have 
been reported. For instance, the training complexity of quantum models has an 
exponential improvement on certain tasks [6]. Stepping into industries, Google 
recently built up a framework TensorFlow Quantum for the design and train- 
ing of quantum machine learning within its famous classical machine learning 
platform—TensorF low [7]. 

Even though quantum machine learning outperforms the classical counter- 
part in some way, the difficulties in the classical world are expected to be encoun- 
tered in the quantum case. Classical machine learning has been found to be 
vulnerable to intentionally crafted adversarial examples (e.g. [8,9]). Adversarial 
examples are inputs to a machine learning algorithm that an attacker has crafted 
to cause the algorithm to make a mistake. One essential mission of machine 
learning is to prove the absence of or detect adversarial examples used in the 
defense strategy—adversarial training [10]—appending adversarial examples to 
the training dataset and retraining the machine learning algorithm to be robust 
to these examples. However, this goal is not easily achieved [11]. The machine 
learning community has developed several interesting ideas on designing spe- 
cific attack algorithms (e.g. [10,12]) to generate adversarial examples, which is 
far from measuring the robustness against any adversary. Recently, the formal 
method community has taken initial steps in this direction [13-16], by verify- 
ing the robustness of classical machine learning algorithms in a provable way: 
either a formal guarantee that the algorithms are robust for a given input or 
a counter-example (adversarial example) is provided if an input is not robust. 
Some tools have been developed, such as VerifAI [17] and NNV [18]. This phe- 
nomenon of vulnerability is more common in the quantum world since quantum 
noise is inevitable in quantum computation, at least in the current NISQ (Noisy 
Intermediate-Scale Quantum) era, and thus led to a series of recent works on 
quantum machine learning robustness against specific noises. For example, Lu 
et al. [19] studied the robustness to various classical adversarial attacks; Du 
et al. [20] proved that by appending depolarization noise in quantum circuits for 
classifications, a robust bound against adversaries can be derived; Liu and Wit- 
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tek [21] gave a robust bound for the quantum noise coming from a special unitary 
group. Very recently, Weber et al. [22] formalized a link between binary quan- 
tum hypothesis testing [23] and robust quantum machine learning algorithms 
for classification tasks. 

Up to our best knowledge, the existing studies of quantum machine learning 
robustness only consider the situation of a known noise source. However, a fun- 
damental difference between quantum and classical machine learning is that the 
quantum attacker is usually the surroundings instead of humans in the classical 
case, and the information of the environment is unknown. To protect against an 
unknown adversary, we need to derive a robust guarantee against a worst-case 
scenario, from which the commonly-assumed known noise sources (e.g. depolar- 
ization noise [20]) are usually far. Yet in the case of unknown noise, several basic 
issues are still unsolved: 


— In theory, it is unclear how to compute a tight and even the optimal bound 
of the robustness for any given quantum machine learning algorithm. 

— In practice, an efficient way to find an adversarial example, which can be used 
to retraining the algorithm to defense the noise, is lacking. Indeed, we do not 
even know which metric is a better choice measuring the robustness against 
noise, the same as the classical case against human attackers [24]. 


In this work, we define a formal framework for the robustness verification 
and analysis of quantum machine learning algorithms against noises in which 
the above problems can be studied in a principled way. More specifically, we 
choose to use fidelity as the metric measuring the robustness as it is one of the 
most widely used quantities to quantify the uncertainty of noise in the process of 
quantum computation, and commonly used in quantum engineering and experi- 
mental communities (e.g. [25,26]). Based on this, an analytical robust bound for 
any quantum machine learning classification algorithm is obtained and can be 
applied to approximately checking the robustness of quantum machine learning 
algorithms. Furthermore, we show that computing the optimal robust bound 
can be reduced to solving a Semidefinite Programming (SDP) problem. These 
results lead to an algorithm to exactly and efficiently check whether or not a 
quantum machine learning algorithm is robust with respect to the training data. 
A special strength of this algorithm is that it can identify useful new train- 
ing data (adversarial examples) during checking, and these data can be used to 
implement adversarial training as the same as classical robustness verification. 
The effectiveness of our robust bound and algorithms is confirmed by the case 
studies of quantum bits classification as the “Hello World” example of quantum 
machine learning algorithms, quantum phase recognition and cluster excitation 
detection from real world intractable physical problems, and the classification of 
MNIST from the classical world. 

In summary, the main technical contributions of the paper are as follows. 


— Computing the optimal robust bound of quantum machine classification algo- 
rithms is reduced to an SDP (Semidefinite Programming) problem; 
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— An efficient algorithm to check the robustness of quantum machine learning 
algorithms and detect adversarial examples is developed; 

— The implementation of the robustness verification algorithm on Google’s Ten- 
sorFlow Quantum; 

— Case studies — Checking the robustness of several popular quantum machine 
learning algorithms for quantum bits classification, cluster excitation detec- 
tion and the classification of MNIST (which are all implemented in Google’s 
TensorFlow Quantum), and quantum phase recognition. 


2 Quantum Data and Computation Models 


For the convenience of the reader, in this section, we recall some basic concepts 
of quantum data (states) and the quantum computation model. 

The basic data of classical computers are bits, represented by two digits 0 
and 1. In quantum computing, quantum bits (qubit) play the same role. A qubit 
is expressed by a normalized complex vector |b) = ; = al0) + b|1) with 


complex numbers a and b satisfying the normalization condition |a|? + |b|? = 1. 
Here, |0) = @ |1) = (i) correspond to bits 0,1 respectively, and {|0), |1)} 


is an orthonormal basis of a 2-dimensional Hilbert (linear) space. In general, for 
a quantum computer consisting of n qubits, a quantum datum is a normalized 
complex vector |y} in a 2”-dimensional Hilbert space H. Such a |} is usually 
called a pure state in the literature of quantum computation. 

As a model for computation, a quantum circuit consists of a sequence of, say 
m quantum logic gates. Each quantum gate can be mathematically represented 
by a unitary matrix U; on H, i.e., UŻU; z uw! = I, where ut is the conjugate 
transpose of U; and I is the identity matrix on H. Then the circuit is represented 
by the unitary matrix U = Um :-- U1. If the quantum datum |w) is inputted to 
the circuit, then the output is a quantum datum: 


lw) = Uy). (1) 


In practice, a quantum datum may not be completely known and can be 
thought of as a mixed state or ensemble { (pp, |x))}x, meaning that it is at |Y) 
with probability px. Mathematically, it can be described by a density operator 
p (Hermitian positive semidefinite matrix with unit trace!) on H: 


p= >_prlve) del, (2) 
k 


where (w| is the conjugate transpose of (Yk), i.e., (Yk) = (wx|'. In this case, the 
model of quantum computation is tuned to be a super-operator £E, i.e. a mapping 
from matrices to matrices. It can be written as 


Ø = Elp). (3) 


1 p has unit trace if tr(p) = 1, where trace tr(p) of p is defined as the summation of 
diagonal elements of p. 
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Here, p and p’ are the input and output data (mixed states) of quantum com- 
putation E, respectively. Not every super-operator € is meaningful in physics. It 
is required to satisfy the following conditions: 


— E is trace-preserving: tr(€(p)) = tr(p) for all mixed state p on H; 

— € is completely positive: for any Hilbert space H’, the trivially extended 
operator idw QE maps density operators to density operators on H’ 8 H, 
where ® denotes the tensor product and idw is the identity map on H’. 


Such a super-operator € admits a Kraus matrix form [2]: there exists a set of 
matrices {E,}, on H such that 


Elp) = X ExpE}. 
k 


Here {Ek }ķ is called Kraus matrices of € [2]. 

The behind dynamics of quantum computers is governed by quantum 
mechanics, which is applied at the microscopic scale (near or less than 107°? 
meters). At this level, we cannot directly readout the quantum data as the same 
as the classical counterpart. The only way to extract information from it is 
through a quantum measurement, which is mathematically modeled by a set 
{Mk} of matrices on its state (Hilbert) space H with >>, MÌ My = I. This 
observing process is probabilistic: if the system is currently in state p, then a 
measurement outcome k is obtained with probability 


pr = tr(M{ Mxp). (4) 


After the measurement, the system’s state will be collapsed (changed), depending 
on the measurement outcome k, which is vitally different from the classical 
computation. If the outcome is k, the post-measurement state becomes 


MxpM| 


(Mi Mp) " 


Pk = 


This special property makes it hard to accurately estimate the distribution {pz}, 
unless enough many copies of p are provided. 

In summary, quantum data have two different forms—pure state |W) and 
mixed state p corresponding to the computation model as a unitary matrix U or 
a super-operator €, respectively. Not surprisingly, the latter is a generalization 
of the former by putting: 


p= W), Elp) = UpU!. 


Because of this, the results obtained for mixed states p can also be applied to 
pure states |}. Thus, in this paper, we mainly consider mixed states as the 
quantum data and super-operators as the model of quantum computation. 
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3 Quantum Classification Algorithms 


In this section, we briefly recall quantum classification algorithms. They are 
designed for classification of quantum data. Essentially, they share the same 
basic ideas with their classical counterparts but deal with quantum data in the 
quantum computation model. 


3.1 Basic Definitions 


In this paper, we focus on a specific learning model called quantum supervised 
classification. Given a Hilbert space H, we write D(H) for the set of all (mixed) 
quantum states on H (see its definition in Eq. (2)). 


Definition 1. A quantum classification algorithm A is a mapping D(H) > C, 
where C is the set of classes we are interested in. 


Following the training strategy of classical machine learning, the classifica- 
tion A is learned through a dataset T instead of being pre-defined. This training 
dataset T = {(p;,¢;)}_, consists of N < oo pairs (p;,¢;), meaning that quan- 
tum state p; belongs to class c;. To learn A, we initialize a quantum learning 
model—a parameterized quantum circuit (including measurement control) Eg 
and a measurement {Mk}kec. Mathematically, the circuit can be modelled as 
a quantum super-operator Eg (see its definition in Eq. (3)), and @ is a set of 
free parameters that can be tuned. Then for each k € C, we can compute the 
probability of the measurement outcome being k: 


f(O, p) = tr(M} MpEo(p)). (6) 


It is worth noting that, as we mentioned before, measuring quantum state p is 
probabilistic and p will be changed after measuring. So, in practice, accurately 
estimating f;,(0,) for all k € C requires enough many copies of p, which is not 
the same as the classical case, where a single copy of classical data often meets 
the training process. 

The quantum classification algorithm A outputs the class label c for a quan- 
tum state p using the following condition: 


A(6, p) = arg max tr(Mj Mp€o(p)). (7) 


The learning is carried out as @ is optimized to minimize the empirical risk 


1 N 
min — X L(f(0, pi), ci), (8) 
4 N z 


where £ refers to a predefined loss function, f(0, p) is a probability vector with 
each fk(0, p), k € C as its element, and c; is also seen as a probability vector with 
the entry corresponding to c; being 1 and others being 0. The goal is to find the 
optimized parameters 6* minimizing the risk in Eq. (8) for the given dataset T. 
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Mean-squared error (MSE) is the most popular instance of the empirical risk, 
i.e., the loss function £ is squared error: 


1 
L(f(9, pi), ci) = Gif, 0%) — ell3, 


where C is the number of classes in C, and ||-||z is the /2-norm. 

As one can see in the above learning process, the main differences between 
classical and quantum machine learning algorithms are the learning models and 
data. 

In this paper, we focus on the well-trained quantum classification algorithm 
A, usually called a quantum classifier. Here, A is said to be well-trained if train- 
ing and validation accuracy are both high (>95%). The training (validation) 
accuracy is the frequency that A successfully classifies the data in a training (val- 
idation) dataset. A validation dataset is mathematically equivalent to a training 
dataset but only for testing A rather than learning A. In this context, 0* is nat- 
urally omitted, i.e., A(p) = A(6*, p) and E(p) = Eo» (p). Briefly, A only consists 
of a super-operator € and a measurement {M;};,, denoted by A = (€,{M;,},). 


3.2 An Illustrative Example 


Let us further illustrate the above definitions by a concrete example—Quantum 
Convolutional Neural Networks (QCNNs) [5], one of the most popular and suc- 
cessful quantum learning models. QCNN extends the main features and struc- 
tures of the Convolutional Neural Networks (CNNs) to quantum computing. 
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Fig. 1. Simple example of CNN and QCNN. QCNN, like CNN, consists of a convolution 
layer that finds a new state and a pooling layer that reduces the size of the model. 
Here, MCUG stands for measurement control unitary gate, i.e., unitary matrix Vj is 
applied on the circuit if and only if the measurement outcome is 1. 
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The model of QCNN applies the convolution layer and the pooling layer from 
CNNs to quantum systems, as shown in Fig. 1(b). The layout proceeds as follows: 


1 The convolution layer (circuit) applies multiple qubit gates U; between adja- 
cent qubits to find a new state; 

2 The pooling layer reduces the size of the quantum system by measuring a 

fraction of qubits, and the outcomes determine unitary V; applied to nearby 

qubits; 

Repeat the convolution layer and pooling layer defined in 1-2; 

4 When the size of the system is sufficiently small, the fully connected layer is 
applied as a unitary matrix F on the remaining qubits. 


w 


The input of QCNNs is an unknown quantum state p;, and the output is 
obtained by measuring a fixed number of output qubits. As in the classical case, 
the learning model (defined as the number of convolution and pooling layers) is 
fixed, but the involved quantum gates (i.e. unitary matrices) U;, Vj, F themselves 
are learned by the above learning process. 


Remark 1. Quantum machine learning can also be used to do classical machine 
learning tasks. Image classification, for example, is one of the most success- 
ful applications of Neural Networks (NNs). To explore the possible advantage of 
quantum computing, Quantum Neural Networks (QNNs) have been used to clas- 
sify images in [27,28]. It is shown that by encoding images to a quantum state 
Pin, QNNs can achieve high accuracy in image classification. We will present a 
quantum classifier for the classification of MNIST as an example in the evalua- 
tion section. 


4 Robustness 


An important issue in classical machine learning is: how robust is a classification 
algorithm to adversarial perturbations. A similar issue exists for quantum clas- 
sifiers against quantum noise. Intuitively, the robustness of quantum classifier 
A is the ability to make correct classification with a small perturbation to the 
input states. Then a quantum state o is considered as an adversarial example if 
it is similar to a benign state p, but p is correctly classified and ø is classified 
into a class different from that of p. Formally, 


Definition 2 (Adversarial Example). Suppose we are given a quantum clas- 
sifier A(-), an input example (p,c), a distance metric D(-,-) and a small enough 
threshold value £ > 0. Then o is said to be an €-adversarial example of p if the 
following is true 


(A(p) = 6) A (Alo) Fc) A (D(p, 0) < £). 


The leftmost condition A(p) = c asserts that p is correctly classified, the mid- 
dle condition A(o) # c means that ø is incorrectly classified, and the rightmost 
condition D(p,o) < £ indicates that p and o are similar (i.e., their distance is 
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small). Sometimes, without any ambiguity, o is called an adversarial example of 
pif £ is preset. Notably, by the above definition, if A incorrectly classifies p, then 
we do not need to consider the corresponding adversarial examples. This is the 
correctness issue of quantum classifier A rather than the robustness issue. Hence, 
in the following discussions, we only consider the set of all correctly recognized 
states. 

The absence of adversarial examples leads to robustness. 


Definition 3 (Adversarial Robustness). Let A be a quantum classifier. 
Then p is e-robust for A if there is no adversarial example of p. 


The major problem concerning us in this paper is the following: 


Problem 1 (Robustness Verification Problem). Given a quantum classifier A(-) 
and an input example (p,c). Check whether or not A(c) = c for all o € Nz(p), 
where N:(p) is the e-neighbourhood of p as 


Ne(p) = {0 € D(H) : D(p,0) <6)}. 
If not, then an adversarial example (counter-example) o € N-(p) is provided. 


Obviously, if 6 is a robust bound for an input example (p, c) such that A(o) = 
c for any state o € Ns(p), then for any e < 6 (i.e. Ne(p) E N5(p)), there is no 
€-adversarial example of p. It is a challenging problem to compute the optimal 
robust bound 6* = max ô so that there is no ¢-adversarial example if and only 
if e < ĝ*. 

The above adversarial robustness of quantum states can be generalized to a 
notion of robustness for quantum classifiers: 


Definition 4 (Robust Accuracy). Let A be a quantum classifier. The £- 
robust accuracy of A is the proportion of -robust states in the training dataset. 


Remark 2. Here, the robust accuracy is defined with respect to the training 
dataset. In some applications, the dataset can be chosen as another set of quan- 
tum states with correct classifications, such as a validation dataset or a combi- 
nation of it with the training dataset. 


The reader should notice that the above definitions of robustness for quantum 
classifiers are similar to those for classical classifiers. But an intrinsic distinct- 
ness between them comes from the choice of distance D(-,-). In the classical case, 
humans play the role of the adversary, and then such a distance should promise 
that a small perturbation is imperceptible to humans, and vice versa. Otherwise, 
we cannot take the advantage of machine learning over human’s distinguishabil- 
ity. For instance, in image recognition, the distance should reflect the perceptual 
similarity in the sense that humans would consider adversarial examples gener- 
ated by it perceptually similar to benign image [24]. In the quantum case, it is 
essential to choose a distance D that is meaningful in quantum physics. In this 
paper, we choose to use the distance: 


D(p,7) =1— F(p,0) 
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defined by fidelity 


F(p,0) = [te(y/Vpova)l. 


Here VP = do, VAk|Ve) (ve if p admits the spectral decomposition )>), Ax 
|Wr) (tz |. Fidelity is one of the most widely used quantities to quantify such 
uncertainty of noise by the experimental quantum physics and quantum engi- 
neering communities (see e.g. [29,30]). 


Remark 3. The trace distance has been used in recent literature (e.g. [20]) for 
some issues related to quantum robustness verification: 


P(p,0) = 5llo— eller = irlo- 0) 2) 


It is a generalization of the total variation distance, which is a distance measure 
for probability distributions. So far, to the best of our knowledge, there is no 
discussion about which distance is better in the literature. Here, we argue that 
fidelity is better than trace distance in the context of quantum machine learning 
against quantum noise. As we know, state distinguishability is the basis of mea- 
suring the effect of noise on quantum computation. The main difference between 
trace distance T(p,o) and fidelity F'(p,c) is the number of copies of states p 
and ø as the resource required in the experiments for distinguishing them. More 
precisely, trace distance quantifies the maximum probability of correctly guess- 
ing through a measurement whether p or ø was prepared, while fidelity asserts 
the same quantity whence infinitely many samples of p and o can be supplied 
(See Appendix A of the extended version of this paper [31] for more details). In 
quantum machine learning, a large enough number of copies of the states are the 
precondition of statistics in Eq. (6) for learning and classification. Thus, fidelity 
is more suitable than trace distance for our purpose. 


5 Robust Bound 


In this section, we develop a theoretic basis for robustness verification of quantum 
classifiers. After setting the distance D to be the one defined by fidelity, a robust 
bound can be derived. 


Lemma 1 (Robust Bound). Given a quantum classifier A = (E,{Mz}rec) 
and a quantum state p. Let pı and p2 be the first and second largest elements of 
{tr(Mj} Mp€(p)) be, respectively. If ,/p1 — ,/p2 > V2¢, then p is e-robust. 


Proof. See Appendix B of the extended version of this paper [31]. 


The above robust bound gives us a quick robustness verification by the mea- 
surement outcomes of p without searching any possible adversarial examples. 
Furthermore, it also can be used to compute an under-approximation of the 
robust accuracy of A by one-by-one checking the robustness of quantum states 
in the training dataset. We will see that the robust bound and the induced robust 
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accuracy scales well in the later experiments. However, \/p1 — \/p2 > V2¢ is not 
a necessary condition of ¢-robustness. Fortunately, when ,/p; — \/p2 < V2, we 
can compute the optimal robust bound by Semidefinite Programming (SDP). 
Recall that SDP is a convex optimization concerned with the optimization of a 
linear objective function over the intersection of the cone of positive semidefinite 
matrices with an affine space. It has the form 


min tr(CX) 
subject to tr(AkX) < by, fork =1,...,m 
X>0 
where C, Aj,..., Am are all Hermitian n x n matrices (i.e. At = A), and X is the 


optimization variable n x n matrix with X > 0, i.e., X is positive semidefinite. 
Many efficient solvers have been developed for solving SDPs—not only compute 
the minimal value, but also output a corresponding optimal solution X. The 
following two theorems show that checking e-robustness and computing optimal 
robust bound of quantum states can both be reduced to an SDP. 


Theorem 1 (c-robustness Verification). Let A = (€,{Mzhrec) be a quan- 
tum classifier and p be a state with A(p) = l. Then p is -robust if and only if for 
allk EC and k £1, the following problem has no solution (feasibility problem): 


min 0 
aE€D(H) 


subject to o > 0 


tr(o) =1 
tr[(M} M; — M}M,)E(c)] < 0 
1— F(p,o) <€ 


Proof. See Appendix C of the extended version of this paper [31]. 


Actually, the objective function 0 in the above theorem can be chosen as any 
constant number. 


Theorem 2 (Optimal Robust Bound). Let A and p be as in Theorem 1 
with A(p) =l, and let ôk be the solution of the following problem: 


L= min 1-F 
ee, (p, 0) 


subject to o >0 
tr(o) =1 
tr[((M/ Mı — M} M,)E(o)] < 0 


where if the problem is unsolved, then õp = +oo. Then ô = mings Ôk is the 
optimal robust bound of p. 


Proof. The proof is similar to Theorem 1. 
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Remark 4. One may wonder why checking ¢-robustness and computing the opti- 
mal robust bound can always be reduced to an SDP. This is indeed implied by the 
basic quantum mechanics postulate of linearity; more specifically, all of the super- 
operators and measurements used in quantum machine learning algorithms are 
linear. In contrast, the functions represented by the neural networks in classi- 
cal machine learning may be nonlinear as the pooling layer is not linear. As a 
result, the reduced optimization problem for the robustness verification is not 
convex (e.g. [32]). For overcoming this difficulty, many different methods have 
been developed to encode the nonlinear activation functions as linear constraints. 
Examples include NSVerify [33], MIPVerify [34], ILP [35] and ImageStar [13]. 


Algorithm 1. StateRobustnessVerifier(A, €, p, L) 


Require: A = (€,{Mz}xec) is a well-trained quantum classifier, € < 1 is a real 
number, (p,l) is an element of the training dataset of A 
Ensure: true indicates p is e-robust or false with an adversarial example o indicates 


p is not e-robust 
1: for each k € C and k 41 do 
2 By an SDP solver, compute ôx with an optimal state ox in the SDP of Theorem 2 
3: end for 
4: Let 6 = min, 6, and k* = arg ming ôk 
5: if 6 > e then 
6: return true 
7: else 
8: return false and Gk» 
9: end if 


6 Robustness Verification Algorithms 


In this section, we develop several algorithms for verifying the robustness of 
quantum classifiers based on the theoretic results presented in the last section. 

First, let us consider the robustness of a given quantum state p. In many 
applications (as shown in our experiments in Sect.7), we are required to check 
whether p is e-robust for an arbitrarily given threshold e. Note that once we 
computed the optimal robust bound 6, checking ¢-robustness of p is equivalent 
to compare £ and 0; that is, € < 6 if and only if p is e-robust. Combining with 
this simple observation with Theorem 1, we obtain Algorithm 1 for checking the 
e-robustness of p and finding the minimum adversarial perturbation 6 caused by 
quantum noise. The main cost of Algorithm 1 incurs in solving SDPs in Line 2, 
which scales as O(n®°) by interior-point methods [36], where n is the number of 
rows of the semidefinite matrix p in SDP, i.e., the dimension of Hilbert space of 
the quantum states in our case. As we need to apply an SDP solver for |C| — 1 
times in Line 1, the total complexity is as follows. 
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Theorem 3. The worst case complexity of Algorithm 1 is O(|C|-n®°), where n 
is the dimension of input state p and |C| is the number of the set C of classes we 
are interested in. 


Now we turn to consider the robustness of a quantum classifier A. Algo- 
rithm 2 is designed for checking robustness of A by combining Algorithm 1 
with Lemma 1 (see the discussion in the paragraph after Lemma 1). A major 
benefit of formal robustness verification for classical classifiers is perhaps that 
it can be used to detect a counter-example (adversarial example) for a given 
input (see e.g. [13-16]). This benefit is kept in Algorithm 2 for the robustness 
verification of quantum classifiers. In particular, we are able to extend the tech- 
nique of adversarial training in classical machine learning [10] into the quantum 
case: an adversarial example øg is automatically generated once ¢-robustness of 
p fails, and then by appending (o,/) into the training dataset, we can retrain A 
to improve the robustness of the classifier. 


Algorithm 2. Robustness Verifier(A, £, T) 

Require: A = (€,{Mx}xec) is a well-trained quantum classifier, € < 1 is a real 
number, T = {(pi, li)} is the training dataset of A 

Ensure: The robust accuracy RA and a set R = {< 0;,i; >}, where for each j, pj 
is an e-adversarial example of p;,; R can be an empty set if all states in T are 
e-robust. 

1: R=9 be an empty set. // Recording adversarial ecamples and corresponding 

indexes of states in training dataset T 

2: for each (p;,l;) € T do 


3: Let pı and pz be the first and second largest elements of {tr(Mj/Mx€(pi)) bx, 
respectively. 

4: if pi — \/p2 < V2e then // Applying the robust bound in Lemma 1 

5: if StateRobustnessVerifier (A, €, pi, li) == false then 

6: o be the output state of StateRobustnessVerifier (A, €, pi, li) 

T: R= RU {(0,i)} 

8: end if 

9: end if 

10: end for 

11: return RA=1- H, R // |R| =0 if R is an empty set 


To analyze the complexity of Algorithm 2, we first see by Theorem 2 that 
for evaluating the robustness of A—computing its robust accuracy and finding 
its adversarial examples, one need to call Algorithm 1 for each quantum state 
in the training dataset, which costs O(|C| - n®5). Thus, the total complexity of 
robustness verification is O(|T|-|C|-n®°), where |T| is the number of elements in 
the training dataset T. However, the robust bound given in Lemma 1 can help 
to speed up the process by quickly finding all potential non-robust states, as the 
complexity of finding the bound is only O(|C|-n°), which is the cost of |C| times 
of the multiplication of two n x n matrices. In practice, this bound scales well, 
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as confirmed by our experiments presented in Sect. 7. Therefore, a good strategy 
for implementing the robustness verification is that we first use robust bound 
to pick up all potential non-robust states from the given training dataset T and 
store them in a set T’. Then we check all left candidates in the training dataset 
T one-by-one using Algorithm 1 and use a set R to record the found adversarial 
examples and the corresponding indexes of states. This strategy can significantly 
reduce the complexity to O(|T’| - |C| -n®°). Indeed, our experiments show that 
the robust bound given in Lemma 1 scales very well in the sense of |T’| < |T]. 


Remark 5. Thanks to the linearity of the quantum learning model determined by 
the basic postulate of quantum mechanics, the robustness verification of quantum 
classifiers can be done in an efficient way (with polynomial time complexity in 
the size of the input state). It is usually not the case in verifying the robustness 
of classical machine learning algorithms. For example, DNNs are often non-linear 
and non-convex, and verifying even some simple properties of them can be an 
NP-complete problem [37]. 

Surprisingly, the robustness verification problem for quantum classifiers 
becomes much harder if we are required to find adversarial examples in pure 
states. Roughly speaking, the reason is that the set of all pure states is not 
convex, and thus computing the optimal robust bound for pure states is not 
an SDP, as in Theorem 2. We can prove that it is a Quadratically Constrained 
Quadratic Program (QCQP), an optimization problem where both the objec- 
tive function and the constraints are quadratic functions (see Appendix D of 
the extended version of this paper [31] for the proof), which is NP-hard. Algo- 
rithm 1 can be adapted to this pure state robustness verification by calling a 
QCQP solver instead of an SDP solver in Line 2. Subsequently, Algorithm 2 
can use this new version of Algorithm 1 as a subroutine to compute the corre- 
sponding robust accuracy and find adversarial examples of pure states. We will 
evaluate the QCQP-based robustness verification in the case study of MNIST 
classification in which handwritten digits are encoded in pure states. 


7 Evaluation 


Algorithm 2 is implemented on TensorFlow Quantum—a platform of Google for 
designing and training quantum machine learning algorithms, by calling an SDP 
solver—CVXPY: Python Software for Disciplined Convex Programming [38]. 
This section aims to evaluate our approach with experiments on some concrete 
examples. This section is arranged as follows. In Subsects. 7.1—7.4, we present 
several well-trained quantum classifiers. Then the evaluation is carried out in 
Subsect. 7.5 by applying Algorithm 2 to check the robustness verification of these 
classifiers and find their adversarial examples if existing. 

To demonstrate our method as sufficiently as possible, we check the robust- 
ness of four quantum classifiers. We begin with a “Hello World” example—qubits 
classification, and then we step in two quantum classifiers applied to real world 
tasks—quantum phase recognition and cluster excitation detection, which are 
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both fundamental and hard problems in quantum physics. At last, to compare 
with classical robustness verification, we consider the classification of MNIST 
by encoding handwritten digital images into quantum data. These experiments 
cover all illustrated examples of TensorFlow Quantum. 


7.1 Quantum Bits Classification 


A “Hello World” example of quantum machine learning is quantum bits clas- 
sification [7]. The aim is to implement a binary classification for regions on a 
single qubit, i.e., a perceptron for qubits. Specifically, two random normalized 
vectors |a) and |b) (pure states) in the X-Z plane of the Bloch sphere are chosen. 
Around these two vectors, we randomly sample two sets of quantum data points; 
the objective is to learn a quantum gate to distinguish the two sets. A concrete 
instance of this type is shown in Fig. 2. In this example, the angles with |0) (Z- 
axis) of the two states |a) and |b) are 0, = 1 and @ = 1.23, respectively; see the 
first figure in Fig. 2. Around these two vectors, we randomly sample two sets (one 
for category “a” and one for category “b”) of quantum data points on the sphere, 
forming a dataset. The dataset consists of 800 samples for the training and 200 
samples for the validation. As shown in Fig. 2, we use a parameterized rotation 
gate R,(9) = e~*7v9/? and a measurement M = {M, = |0)(0|, My = |1)(1|} to 
do the classification. Targeting to minimizing the MSE form of Eq. (8), we use 
Adam optimizer [39] to update 6. After training, we achieve both 100% training 
and validation accuracy, and the final parameter 0 is 0.4835. 


Fig. 2. Training model of quantum bits classification: the left figure shows the samples 
of the quantum training dataset represented on the Bloch sphere. Samples are divided 
into two categories, marked by red and yellow, respectively. The vectors are the states 
around which the samples were taken. The first part of the right figure is a parame- 
terized rotation gate, whose job is to remove the super-positions in the quantum data. 
The second part is a measurement M along the Z-axis of the Bloch sphere converting 
the quantum data into classes. (Color figure online) 
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7.2 Quantum Phase Recognition 


Quantum phase recognition (QPR) of one dimensional many-body systems has 
been attacked by quantum convolutional neural networks (QCNNs) proposed by 
Cong et al. [5]. Consider a Z x Z2 symmetry-protected topological (SPT) phase 
P and the ground states of a family of Hamiltonians on spin-1/2 chain with open 
boundary conditions: 


N-2 N N-1 
H =-J 5 ZiXi+1Zi+2 — hy 5 Xi — he 5 XiXişı 


i=l i=l i=1 


where X;, Z; are Pauli matrices [2] for the spin at site i, and the Z2 x Z2 symmetry 
is generated by Xeven(odd) = Liceven(oaa) Xi. The goal is to identify whether the 
ground state |) of H belongs to phase P when H is regarded as a function 
f (hi/J,h2/J). For small N, a numerical simulation can be used to exactly 
solve this problem [5]; See Fig. 4a in Appendix E of the extended version of this 
paper [31] for the exact phase boundary points (blue and red diamonds) between 
SPT phase and non-SPT (paramagnetic or antiferromagnetic) phase for N = 6. 
Thus the 6-qubit instance is an excellent testbed for different new methods and 
techniques of QPR. Here, we train a QCNN model to implement 6-qubit QPR 
in this setting. 
To generate the dataset for training, we sample a serials of Hamiltonian 
H with ho/J = 0, uniformly varying hi/J from 0 to 1.2 and compute their 
corresponding ground states; see the gray line of Fig.4a in Appendix E of 
the extended version of this paper [31]. For the testing, we uniformly sample 
a set of validation data from two random regions of the 2-dimensional space 
(hi/J,h2/J); see the two dashed rectangles of Fig. 4a. Finally, we obtain 1000 
training data and 400 validation data. Our parameterized QCNN circuit is shown 
in Fig. 4b in Appendix E of the extended version of this paper [31], and the uni- 
taries U;, Vj, F are parameterized with generalized Gell-Mann matrix basis [40]: 
U = exp(-i i 6;A;), where A; is a matrix and 6; is a real number; the total 
number of parameters 0;, A; is 114. For the outcome measurement of one qubit, 
we use measurement M = {Mp = = |+)(+|, Mı = I) to predict that input 
states belongs to P with output 0, where |+) = Ja (10) + |1)). Targeting to 


minimizing the MSE form of Eq. (8), we use Adam optimizer to update the 
114 parameters. After training, 97.7% training accuracy and 95.25% validation 
accuracy are obtained. At the same time, our classifier conducts a phase dia- 
gram (the colorful figure in Fig. 4a), where the learned phase boundary almost 
perfectly matches the exact one gotten by the numerical simulation. All these 
results indicate that our classifier is well-trained. 


7.3 Cluster Excitation Detection 


The task of cluster excitation detection is to train a quantum classifier to detect 
if a prepared cluster state is “excited” or not [7]. Excitations are represented with 
a X rotation on one qubit. A large enough rotation is deemed to be an excited 
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state and is labeled by 0, while a rotation that isn’t large enough is labeled by 1 
and is not deemed to be an excited state. Here, we demonstrate this classification 
task with 6 qubits. We use the circuit shown in Fig. 5a of Appendix E in the 
extended version of this paper [31] to generate training (840) and validation 
(360) samples. The circuit generates a cluster state by performing a X rotation 
(we omit angle 0) on one qubit. The rotation angle @ is ranging from —7z to 7 and 
if —r/2 < 0 < 7/2, the label of the output state is 1; otherwise, the label is 0. 
The classification circuit model (a quantum convolutional neural network) uses 
the same structure in TensorFlow Quantum [7], shown in Fig. 5b of Appendix E 
in the extended version of this paper [31]. The explicit parameterization of C;, P; 
can be found in [7]. The final measurement M = {Mo = |0)(0|, Mi = Dal}. 
Targeting to minimizing the MSE form of Eq. (8), we use Adam optimizer to 
update all C;, P;. We achieve 99.76% training accuracy and 99.44% validation 
accuracy. 


7.4 The Classification of MNIST 


Handwritten digit recognition is one of the most popular tasks in the classical 
machine learning zoo. The archetypical training and validation data come from 
the MNIST dataset which consists of 55,000 training samples handwritten dig- 
its [41]. These digits have been labeled by humans as representing one of the 
ten digits from number 0 to 9, and are in the form of gray-scale images that 
contains 28 x 28 pixels. Each pixel has a grayscale value ranging from 0 to 255. 
Quantum machine learning has been used to distinguish a too simplified version 
of MNIST by downscaling the image sizes to 8 x 8 pixels. Subsequently, the 
numbers represented by this version of MNIST can not be perceptually recog- 
nized [7]. Here, we build up a quantum classifier to recognize a MNIST version 
of 16 x 16 pixels (see second column images of Fig.3). As demonstrated in [7], 
we select out 700 images of number 3 and 700 images of number 6 to form our 
training (1000 images) and validation (400 images) datasets. Then we downscale 
those 28 x 28 images to 2+ x 24 images (fitting the size of quantum data), and 
encode them into the pure states of 8 qubits via amplitude encoding. Amplitude 
encoding uses the amplitude of computational basis to represent vectors with 
normalization: 


Ti 


= 
i=0 Le. lel? 


li). 


where {|z)} is a set of orthogonal basis of the 8 qubits state space. The nor- 
malization doesn’t change the pattern of those images. For learning a quantum 
classifier, we use the QCNN model in Fig. 6 of Appendix E in the extended ver- 
sion of this paper [31] and use measurement M = {Mo = |+)(+|, Mı = |—)(-|}. 
The output of measurement M indicates the numbers: output 1 for number 3 
and output 0 for number 6. The explicit parameterization of those C;, P; can 
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be found in [7]. Again we use Adam optimizer to update the model parameters 
minimizing the MSE form of Eq. (8). We finally achieve 98.4% training accuracy 
and 97.5% validation accuracy. 


7.5 Robustness Verification 


Now, we start to check the ¢-robustness for the above four well-trained classifiers 
presented in the previous four subsections. 

In practical applications, the value of robustness £ in Definition 3 represents 
the ability of state preparation by quantum controls. For example, the state-of- 
the-art is that a single qubit can be prepared with fidelity 99.99% (e.g. [29,30]). 
Here, we choose four different values of £ in each experiment. 

To show the scalability of our robust bound given in Lemma 1, we use it to 
develop an algorithm (Algorithm 3 in Appendix F of the extended version of 
this paper [31]) to under-approximate the robust accuracy, which is computed 
by Algorithm 2. Algorithm 3 is a subroutine of Algorithm 2 without calling an 
SDP solver (whenever a potential non-robust state can be detected by the robust 
bound in Lemma 1). We compare the verification times by Algorithms 2 and 3. 


Table 1. Verification results of quantum bits classification 


Robust accuracy (in percent) 
e = 0.001 | € = 0.002 | e = 0.003 | € = 0.004 
Robust bound (Lemma 1 - Algorithm 3) | 88.13 75.88 58.88 38.25 


Robustness algorithm (Theorem 2 - 90.00 76.50 59.75 38.88 
Algorithm 2) 


Verification times (in seconds) 
Robust bound (Lemma 1 - Algorithm 3) | 0.0050 0.0048 0.0047 0.0048 


Robustness algorithm (Theorem 2 - 1.3260 2.7071 4.6285 6.9095 
Algorithm 2) 


Table 2. Verification results of quantum phase recognition. 


Robust accuracy (in percent) 
e = 0.0001 | e = 0.0002 | € = 0.0003 | € = 0.0004 
Robust bound (Lemma 1 - Algorithm 3) | 99.20 98.80 98.60 98.30 


Robustness Algorithm (Theorem 2 - 99.20 98.80 98.60 98.40 
Algorithm 2) 


Verification times (in seconds) 
Robust bound (Lemma 1 - Algorithm 3) | 1.4892 1.4850 1.4644 1.4789 


Robustness algorithm (Theorem 2 - 19.531 25.648 28.738 33.537 
Algorithm 2) 
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The experiments are done on a computer with the following configurations: 
Intel(R) Core(TM) i7-9700 CPU @ 3.00 GHz x 8 Processor, 15.8 GiB Memory, 
Ubuntu 18.04.5 LTS, with CVXPY: Python Software for Disciplined Convex 
Programming [38] for solving SDP, and a SciPy solver for finding the minimum 
of constrained nonlinear multivariable function for solving QCQP. 

The experimental results are given in Tables1, 2, 3 and 4. As an example, 
we illustrate the details of the result for the case of e = 0.001 in Table 1. First, 
we only apply our robust bound in Lemma 1 to pick up all potential non-robust 
states from the 800 points in the training dataset. Then 95 points are left. Thus, 
the under-approximation of the robust accuracy computed by Algorithm 3 (in 
Appendix F of the extended version of this paper [31]) is 88.13%. Next, we check 
the 0.001-robustness by Algorithm 2. Indeed, only 80 of the points detected by 
the above robust bound are non-robust and the exact robust accuracy is 90.00%. 
We also compare the verification time of the two approaches to the robust accu- 
racy. See the second column in Tablel for the detail, and other experiment 
results of c-robustness are also summarized in the same table. Tables 1, 2, 3 and 
4 for the verification results show that in all of these experiments, the robust 
bound obtained in Lemma 1 scales very well, and the robustness verification by 
Algorithm 3 costs significantly less time (<2s) than the way of computing the 
optimal robust bound by Algorithm 2. For example, for quantum phase recog- 


Table 3. Verification results of cluster excitation detection 


Robust accuracy (in percent) 
e = 0.0001 | e = 0.0002 | € = 0.0003 | € = 0.0004 
Robust bound (Lemma 1 - Algorithm 3) | 99.05 98.81 98.21 97.86 


Robustness algorithm (Theorem 2 - 100.0 100.0 100.0 100.0 
Algorithm 2) 


Verification times (in seconds) 
Robust bound (Lemma 1 - Algorithm 3) | 1.2899 1.2794 1.2544 1.2567 


Robustness algorithm (Theorem 2 - 209.52 244.79 325.97 365.30 
Algorithm 2) 


Table 4. Verification results of the classification of MNIST 


Robust accuracy (in percent) 
e = 0.0001 | e = 0.0002 | € = 0.0003 | € = 0.0004 
Robust bound (Lemma 1 - Algorithm 3) | 99.70 99.40 99.30 99.20 


Robustness algorithm (Theorem 2 - 99.80 99.60 99.30 99.30 
Algorithm 2) 


Verification times (in seconds) 
Robust bound (Lemma 1 - Algorithm 3) | 0.0803 0.1315 0.0775 0.0811 


Robustness algorithm (Theorem 2 - 0.3955 0.6751 0.7653 0.8855 
Algorithm 2) 
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nition, for € = 0.0001, 0.0002 and 0.0003, the under-approximation of the robust 
accuracy is the same as the real value. Even for the last case of e = 0.0004, 
only the 0.1% difference is got. Furthermore, from the tables, the verification 
time of Algorithm 2 is increasing with the value of e, while the running time 
of the method by the robust bound is almost unchanged. This is because the 
former algorithm uses an SDP or QCQP solver to search all possible adversarial 
examples for the potential non-robust states picked up by the robust bound, and 
the number of these states are growing up with the value of e. These counter- 
examples detected by the algorithm confirm that our robustness framework is 
effective. For instance, see Fig.3 for two visualized adversarial examples gen- 
erated by Algorithm 2 with a QCQP solver. As we can see, the benign and 
adversarial images are perceptually similar. This also proves that our robustness 
verification algorithm can detect not only quantum but also classical adversarial 
examples. 


label 3 label 6 


= 
resize 


label 6 label 3 


Pa 
resize 


Fig. 3. Two training states and their adversarial examples generated by Algorithm 2 
with a QCQP solver: the first column images are 28 x 28 benign data from MNIST; The 
second column shows the two downscaled 16 x 16 grayscale images; The last column 
images are decoded from adversarial examples founded by Algorithm 2. The third 
column images are the grayscale difference between benign and adversarial images. 


8 Conclusion 


In this work, we initiate the research of the formal robustness verification of 
quantum machine learning algorithms against unknown quantum noise. We 
found an analytical robustness bound which can be efficiently computed to 
under-approximate the robust accuracy in practical applications. Furthermore, 
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we developed a robustness verification algorithm that can exactly verify the e- 
robustness of quantum machine learning algorithms and provides useful counter- 
examples for the adversarial training. 

For topics for future research, it should be useful in practical applications to 
find an efficient method that over-approximates the robust accuracy of quantum 
classifiers. Combined with the under-approximation approach developed in this 
work, it can help us to more accurately and fast estimate the robust accuracy. In 
classical machine learning, there exist some works in the literature to achieve this 
task. For instance, ImageStars, a new set representation, was introduced in [13] to 
perform efficient set-based analysis by combining operations on concrete images 
with linear programming, which leads to efficient over-approximative analysis of 
classical convolutional neural networks. 

Tensor networks are one of the best-known data structures for implementing 
large-scale quantum classifiers (e.g. QCNNs with 45 qubits in [5]). For practical 
applications, we are going to incorporate tensor networks into our robustness 
verification algorithm so that it can scale up to achieve the demand of NISQ 
devices (of >50 qubits). 

More generally, further investigations are required to better understand the 
role of robustness in quantum machine learning, especially through more exper- 
iments on real world applications like learning phases of quantum many-body 
systems. 
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Abstract. Verifying and explaining the behavior of neural networks is becom- 
ing increasingly important, especially when they are deployed in safety-critical 
applications. In this paper, we study verification and interpretability problems 
for Binarized Neural Networks (BNNs), the 1-bit quantization of general real- 
numbered neural networks. Our approach is to encode BNNs into Binary Deci- 
sion Diagrams (BDDs), which is done by exploiting the internal structure of the 
BNNs. In particular, we translate the input-output relation of blocks in BNNs to 
cardinality constraints which are in turn encoded by BDDs. Based on the encod- 
ing, we develop a quantitative framework for BNNs where precise and compre- 
hensive analysis of BNNs can be performed. We demonstrate the application of 
our framework by providing quantitative robustness analysis and interpretability 
for BNNs. We implement a prototype tool BDD4BNN and carry out extensive 
experiments, confirming the effectiveness and efficiency of our approach. 


1 Introduction 


Deep neural networks (DNNs) have achieved human-level performance in several 
tasks, and are increasingly being incorporated into various application domains such as 
autonomous driving [4] and medical diagnostics [53]. Modern DNNs usually contain 
a great many parameters which are typically stored as 32/64-bit floating-point num- 
bers, and require a massive amount of floating-point operations to compute the output 
for a single input [60]. As a result, it is often challenging to deploy them on resource- 
constrained, embedded devices. To mitigate the issue, quantization, which quantizes 
32/64-bit floating-points to low bit-width fixed-points (e.g., 4-bits) with little accuracy 
loss [23], emerges as a promising technique to reduce resource requirements. In par- 
ticular, binarized neural networks (BNNs) [27] represent the case of 1-bit quantization 
using the bipolar binaries +1. BNNs can drastically reduce memory storage and exe- 
cution time with bit-wise operations, hence substantially improve the time and energy 
efficiency. BNNs have been demonstrated to achieve a high accuracy for a wide variety 
of applications [34,41,52]. 
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DNNs have been shown to lack robustness [11,14,36,49,59] and interpretability 
of the predictions they make [25,43]. Various formal techniques and heuristics have 
been proposed to analyze DNNs and interpret their behaviors, most of which focus on 
real-numbered DNNs only. Verification of quantized DNNs has not been thoroughly 
explored so far, although recent results have highlighted its importance: it was shown 
that a quantized DNN does not necessarily preserve the properties satisfied by the real- 
numbered DNN before quantization [14,22]. Indeed, the fixed-point number semantics 
effectively yields a discrete state space for the verification of quantized DNNs whereas 
real-numbered DNNs feature a continuous state space. The discrepancy could inval- 
idate current verification techniques for real-numbered DNNs when they are directly 
applied to the quantized counterparts (e.g., both false negative and false positive could 
occur). Therefore, specialized techniques are required for rigorously verifying quan- 
tized DNNs. 

Broadly speaking, the existing techniques for quantized DNNs make use of con- 
straint solving which is based on either SAT/SMT or (reduced, ordered) binary decision 
diagrams (BDDs). A majority of work resorts to SAT/SMT solving. For the 1-bit quan- 
tization (i.e., BNNs), typically BNNs are transformed into Boolean formulas where 
SAT solving is harnessed [12,33,45,46]. Some recent work also studies variants of 
BNNs [28,48], i.e., BNNs with ternary weights. For quantized DNNs with multiple bits 
(i.e., fixed-points), it is natural to encode them as quantifier-free SMT formulas, e.g., 
using bit-vector and fixed-point theories [7,22,24], so that off-the-shelf SMT solvers 
can be leveraged. In another direction, BDD-based approaches currently can tackle 
BNNs only [54]. In a nutshell, they encode a BNN and an input region as a BDD, 
based on which various analyses can be performed via queries on the BDD. The crux 
of the approach is how to generate the BDD efficiently. In the work [54], the BDD is 
constructed by BDD learning [44], thus, currently limited to toy BNNs (e.g., 64 input 
size, 5 hidden neurons, and 2 output size) with relatively small input regions. 

On the other hand, existing work mostly focuses on qualitative verification, which 
asks whether there exists an input x (in a specified region) for a neural network such that 
a property (e.g., local robustness) is violated. In many practical applications, checking 
only the existence is not sufficient. Indeed, for local robustness, such an (adversarial) 
input almost surely exists which makes a qualitative answer less meaningful. Instead, 
quantitative verification, which asks how often a property ¢ is satisfied or violated, is 
far more useful yet more challenging as it could provide a probabilistic guarantee of 
the behavior of neural networks. Such a quantitative guarantee is essential to certify, 
for instance, certain implementations of neural network based perceptual components 
against safety standards of autonomous vehicles [29,32]. Quantitative analysis of gen- 
eral neural networks, however, is challenging, hence received little attention and for 
which the results are rather limited so far. DeepSRGR [69] presented an abstract inter- 
pretation based quantitative robustness verification approach for DNNs which is sound 
but incomplete. For BNNs, approximate SAT model-counting solvers (#SAT) are lever- 
aged [6,47] based on the SAT encoding for the qualitative counterpart. Though proba- 
bly approximately correct (PAC) style guarantees can be provided, verification cost is 
usually prohibitively high to achieve higher precision and confidence. 
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Main Contributions. We propose a BDD-based framework BDD4BNN to support 
quantitative analysis of BNNs. The main challenge is how to efficiently build BDDs 
from BNNs [47]. In contrast to previous work [54] which is learning-based and largely 
treats the BNN as a blackbox, we directly encode a BNN and the associated input 
region into BDDs. In a nutshell, a BNN is a sequential composition of multiple internal 
blocks and one output block. Each block comprises 3 layers and captures a function 
f : {+1,-1}" —> {+1,-1}”, where n (resp. m) denotes the number of inputs (resp. out- 
puts) of the block. Technically, the function f can be alternatively rewritten as a function 
over the standard Boolean domain, i.e., f : {0, 1}” — {0, 1}. A key stepping-stone of 
our encoding is the observation that the i-th output y; of the block can be captured by 
a cardinality constraint of the form }i"_, €; = k such that y; = +1 @ Yi) lj = k, 
where each literal £; is either x; or ~x; for the input variable x;, and k is a constant. 
We then present an algorithm to encode a cardinality constraint pee €; 2 k as a BDD 
with O((n — k) - k) nodes in O((n — k) - k) time. As a result, the input-output relation 
of each block can be encoded as a BDD, the composition of which yields the BDD for 
the entire BNN. A distinguished advantage of our BDD encoding lies in its support of 
incremental encoding. In particular, when different input regions are of interest, there is 
no need to construct the BDD of the entire BNN from scratch. 

Encoding BNNs as BDDs enables a wide variety of applications in security analysis 
and decision explanation of BNNs. In this paper, we highlight two of them within our 
framework, i.e., robustness analysis and interpretability. It was shown that DNNs have 
been suffering from poor robustness to adversarial examples [49,50,59]. We consider 
two quantitative variants of the problem: (1) how many adversarial examples does the 
BNN have in the input region, and (2) how many of them are misclassified to each 
class? We further provide an algorithm to incrementally compute the (locally) maximal 
Hamming distance within which the BNN satisfies the desired robustness properties. 

Interpretability is an issue arisen as a result of the blackbox nature of DNNs [25,43]. 
In application domains such as medical diagnosis, understanding the decisions made by 
DNNs is a must. We consider two problems: (1) why some inputs are (mis)classified 
into a class by the BNN and (2) are there any essential features in the input region that 
are common for all samples classified into a class? 


Experimental Results. We implement our framework as a prototype tool BBD4BNN 
using the CUDD package [58], which scales to BNNs with up to 4 internal blocks, 
200 hidden neurons, and 784 input size. To the best of our knowledge, it is the first 
work to precisely and quantitatively analyze such large BNNs that go significantly 
beyond the state-of-the-art. The experimental results show that BDD4BNN is signifi- 
cantly more efficient and scalable than the learning-based technique [54]. Furthermore, 
we demonstrate how BDD4BNN can be used in quantitative robustness analysis and 
decision explanation of BNNs. For quantitative robustness analysis, our experimental 
results show that BDD4BNN is considerably (5x to 1,340x) faster and more accurate 
than the state-of-the-art approximate #SAT-based approach [6]. It can also compute pre- 
cisely the distribution of predicated classes of the images in the input region as well as 
the locally maximal Hamming distances on several BNNs. For decision explanation, 
we show the effectiveness of BDD4BNN in computing prime-implicant explanations 
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and essential features of the given input region for some target classes. Note that this 

work focuses on quantitative verification and interpretability of BNNs and may under- 

perform SAT/SMT-based methods [12,33,45,46] for qualitative verification of BNNs. 
In general, our main contributions can be summarized as follows. 


Fig. 1. Architecture of a BNN with d + 1 blocks 


— We introduce a novel algorithmic approach for encoding BNNs into BDDs that 
exactly preserves the semantics of BNNs and supports incremental encoding. 

— We propose a framework for quantitative verification of BNNs and in particular, we 
demonstrate the robustness analysis and interpretability of BNNs. 

— We implement the framework as an end-to-end tool BDD4BNN and conduct thor- 
ough experiments on various BNNs, demonstrating the efficiency and effectiveness 
of BDD4BNN. 


2 Preliminaries 


In this section, we briefly introduce binarized neural networks (BNNs) and (reduced, 
ordered) binary decision diagrams (BDDs). 

We denote by R, N, B, and B4; the set of real numbers, the set of natural numbers, 
the standard Boolean domain {0, 1} and the integer set {+1,—1}. For n € N, we denote 
by [n] the set {1,--- ,n}. We will use W, W’,... to denote (2-dimensional) matrices, 
x,Vv,--+ to denote (row) vectors, and x,v,... to denote scalars. We denote by W;.. and 
W. j the i-th row and j-th column of the matrix W. Similarly, we denote by x; and W;,; 
the j-th entry of x and W;. respectively. In this work, Boolean values 1/0 will be used 
as integers 1/0 in arithmetic computations without typecasting. 


2.1 Binarized Neural Networks 


A binarized neural network (BNN) [27] is a neural network where weights and acti- 
vations are predominantly binarized over the domain B,,. In this work, we consider 
feed-forward BNNs. As shown in Fig. 1, a BNN can be seen as a sequential composi- 
tion of several internal blocks and one output block. Each internal block comprises 3 
layers: a linear layer (LIN), a batch normalization layer (BN), and a binarization layer 
(BIN). The output block comprises a linear layer and an ARGMAX layer. Note that 
the input/output of internal blocks and the input of the output block are all vectors over 
But. 
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Table 1. Definitions of layers in BNNs, where nz,. = s and arg max(-) returns the index of the 
largest entry which occurs first. 


Layer Function Parameters Definition 

LIN p : BY, > R™1 | Weight matrix: W € Bike t(x) = y where Yj € [nj+1], 
Bias (row) vector: b € R” yj= (x, W. j) +b; 

BN en :R’+1 = R”i+! | Weight vectors: œ € R”! gn (x) = y where Yj € [ni+1], 
Bias vector: y € R”i+! yj=aj oo) +Y; 


Mean vector: u € R”i+ 
Std. dev. vector: o € R”! 


BIN im Ruel BM |- (x) = y where Vj € [nj+1], 
_ ft, if x; > 0; 
a -1, otherwise. 
ARGMAX ay : R > BS - tr œ = y where Vj € [s], 


yj =1@ j= arg max(x) 


Definition 1. A BNN N : B”; — B® with s classes is given by a tuple of blocks 
(t1,°++ „ta, ta+1) such that N = tay, © ta ° +++ Ot, 


— for every i € [d], ti : Br > Bhir is an internal block comprising a LIN layer tr, a 
BN layer pr and a BIN gein with ti = jin o ee o | 
— tas; : Bt"! — BS is the output block comprising a LIN layer ie and an ARGMAX 


+1 + 
m ‘ = 7am lin 
layer t’, with tay, = tG © bep 


where pu w, gi forie [d], t and t% 


‘dal a are given in Table 1. 


Intuitively, a LIN layer is a linear transformation. A BN layer following a LIN layer 
is used to standardize and normalize the output of the LIN layer. A BIN layer is used 
to binarize the real-numbered output vector of the BN layer. In this work, we consider 
the sign function which is widely used in BNNs to binarize real-numbered vectors. An 
ARGMAX layer follows a LIN layer and outputs the index of the largest entry as the 
predicted class which is represented by a one-hot vector. (In case there is more than one 
such entry, the first one is returned.) Formally, given a BNN N = (f1,-:: , ta, ta+1) and 
an input x € B”', N(x) € B* is a one-hot vector in which the index of the non-zero 


+1” 
entry is the predicated class. 


2.2 Binary Decision Diagrams 


A BDD [9] is a rooted acyclic directed graph where non-terminal nodes v are labeled by 
Boolean variables var(v) and terminal nodes (leaves) v are labeled with values val(v) € 
B, referred to as the 1-leaf and the 0-leaf respectively. Each non-terminal node v has two 
outgoing edges: hi(v) meaning var(v) = 1 and lo(v) meaning var(v) = 0. We will also 
refer to hi(v) and lo(v) as the hi and lo children of v respectively. Moreover, assuming 
that x1,--: ,Xm is the variable ordering, for each node v with var(v) = x; and each 
v € {hi(v),lo(v)} with var(v’) = xj, we have i < j. In the graphical representation 
of BDDs, hi(v) and lo(v) are depicted by solid and dashed lines respectively. Multi- 
Terminal Binary Decision Diagrams (MTBDDs) are a variant of BDDs in which the 
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Table 2. Some basic BDD operations, where 
op E€ {Anpb, Or, XOR, XNOR} 


Operation Description 
v = Var(x) OOE 
v = Const(1) f=l 
v = Const(0) f=9 
Nor(v) af, 
ApPLy(¥, v’, op) fi op fv 
Exists(v, X) AX. f, 
SatALL(v) SaTALL(f,,) 
Fig. 2. The reduced BDD for f(x,y, RerProp(v, v’) fr © fy 
x232) = Or S y1) A (Ha © Ya) ITE, v, v’) (Œ AV) V Gx Av’) 


terminal nodes are not restricted to be 0 or 1. A BDD is reduced if it (1) has only one 
1-leaf and one 0-leaf, (2) does not contain a node v such that hi(v) = lo(v), and (3) 
does not contain two distinct non-terminal nodes v and v’ such that var(v) = var(v’), 
hi(v) = hi(v’) and lo(v) = lo(v’). For example, Fig. 2 shows the reduced BDD for the 
Boolean function f(x1, y1, X2, y2) = (x1 © y1) A (x2 © y2). Hereafter, we assume that 
BDDs are reduced. 

Bryant [9] showed that BDDs can serve as a canonical form of Boolean functions. 
Given a BDD over variables x1,--+ ,Xm, each non-terminal node v with var(v) = xi 
represents a Boolean function f, = (xi A fhiw) V (xi A flow). Operations on Boolean 
functions can usually be efficiently implemented via manipulating their BDD represen- 
tations. A good variable ordering is crucial for the performance of BDD manipulations 
while the problem of finding an optimal ordering for a function is NP-hard. To store and 
manipulate BDDs efficiently, the nodes are stored in a hash table and the recent com- 
puted results are stored in a cache to avoid duplicated computations. In this work, we 
will use some basic BDD operations such as ITE for If-Then-Else, Xor for exclusive- 
OR, Xnor for exclusive-NOR (i.e., a XNor b = 7(a Xor b)) and SATALL(f,) for the 
set of all solutions of the Boolean formula f,. We denote by £(v) the set SatALL(f,). 
For easy reference, more operations are given in Table 2. By op(v, v’) we denote the 
operation AppLy(v, v’, op). 


3 BDD4BNN Design 


3.1 BDD4BNN Overview 


An overview of BDD4BNN is depicted in Fig. 3. BDD4BNN comprises four main com- 
ponents: Region2BDD, BNN2CC, BDD Model Builder, and Query Engine. For a fixed 
BNN N = (ti,++- tg, tq+1) and a region R of the input space of N, BDD4BNN con- 
structs the BDDs (GP Viets| to encode the input-output relation of N in the region R, 
where the BDD G? corresponds to the class į € [s]. Technically, the region R is parti- 
tioned into s parts represented by (G?"");e[sj. For each property query, BDD4BNN ana- 
lyzes (G?")icts and outputs the query result. 
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BDD4BNN 
Query Engine WN 


BDD Model e Robustness Result 
e Interpretability é 


(a) Xi t; >k (b) xi + 7X2 + x3 + 7X4 + X5 + 7% 2 3 


Fig. 4. Graphic representation of BDDs using Algorithm 1 


The general workflow of our approach is as follows. First, Region2BDD builds up a 
BDD GÈ from the region R which represents the desired input space of N for analysis. 
Second, BNN2CC transforms each block of the BNN N into a set of cardinality con- 
straints (CCs) similar to [6,46]. Third, BDD Model Builder builds the BDDs (G ies] 
from all the cardinality constraints and the BDD Ge, Finally, Query Engine answers 
queries by analyzing the BDDs (G?") jets}. Our Query Engine currently supports two 
types of application queries: robustness analysis and interpretability. 

In the rest of this section, we first introduce the key sub-component CC2BDD, 
which provides an encoding of cardinality constraints into BDDs. We then provide 
details of the components Region2BDD, BNN2CC, and BDD Model Builder. The 
Query Engine will be described in Sect. 4. 


3.2 CC2BDD: Cardinality Constraints to BDDs 


A cardinality constraint is a constraint of the form Xj- fj = k over a vector x of 
Boolean variables with length n, where the literal £; is either x; or ~x; for each j € [n]. 
Note that constraints of the form X'-1 €j > k, Xiz €; < kand X”; lj < k are equivalent 
to Xi €;>k+1, ae) at; >n—kand Pe at; >n—k +1, respectively. We assume 
that 1 (resp. 0) is a special cardinality constraint that always holds (resp. never holds). 
To encode })_; fj = k as a BDD, we observe that all the possible solutions of 
PA €; = k can be compactly represented by a BDD-like graph shown in Fig. 4(a), 
where each node is labeled by a literal, and a solid (resp. dashed) edge from a node 
labeled by €; means that the value of the literal £; is 1 (resp. 0). Thus, each path from the 
€,-node to the 1-leaf through the ;-node (where 1 < j < n) captures a set of valuations 
where £; followed by a (horizontal) dashed line is set to be 0 while £; followed by 
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Algorithm 1: BDD Construction for cardinality constraints 
1 Proc CC2BDD(CC : ©", €; > k) 


2 Gkr = Geta = +++ = Gesin-ner = Const(1); 

3 Gi n-k12 = G2n-k+2 = +++ = Gkn-k+2 = Cons1(0); 

4 for (i =k; i > 1; i— —) do 

5 for (j=n-k+1; j2 1; j- -—)do 

6 if (Cin j-1 == Xij-1) then G; j = ITE(xij-1, Giri, js Gi j+1); 
7 else G; j = ITEQi+j-1, Gi j+1, Gist) 

8 return G; 


a (vertical) solid line is set to be 1, and all the other literals which are not along the 
path can take arbitrary values. Clearly, for each of these valuations, there are at least k 
positive literals, hence the constraint pes €; = k holds. 

Based on the above observation, we build the BDD for >"_,€; 2 k using 
Algorithm 1. It builds a BDD for each node in Fig. 4(a), row-by-row (the index i in 
Algorithm 1) and from right to left (the index j in Algorithm 1). For each node at the 
i-th row and j-th column, the label of the node must be the literal £41. We build the 
BDD Gij = ITE(%;4j-1, Gi+i, j Gi j+1) if fisj-1 is of the form Xi+j-1 (Line 6), otherwise 
we build the BDD G; j = ITE(xjs.j-1, Gi, j+1, Gi+1,j) (Line 7). Finally, we obtain the BDD 
G1, that encodes the solutions of viet €; = k. Consider xı +7.x2+%34+7x44+%5+7X6 2 3, 
Fig. 4(b) shows its BDD by Algorithm 1. 


Lemma 1. For each cardinality constraint Dei €; 2 k, a BDD G with O(n — k) - k) 
nodes can be computed in O((n—k)-k) time such that L(G) is the set of all the solutions 


of DG =k 


Compared with prior works [8,42] which transform general arithmetic constraints 
into BDDs, we devise a dedicated BDD encoding algorithm for the cardinality con- 
straints without applying reduction, hence it is more efficient. 


3.3 Region2BDD: Input Regions to BDDs 
In this paper, we consider the following two types of input regions. 


— Input region based on Hamming distance. For an input u € B*, and an integer 
r > 0, R(u, r) denotes the set {x € Br | HD(x,u) < r}, where HD(x, u) denotes the 
Hamming distance between x and u. Intuitively, R(u, r) includes the input vectors 
which differ from u by at most r positions. 

— Input region with fixed indices. For an input u € BY’, and a set of indices 7 C [m], 
R(u, I) denotes the set {x € Br | Vi € [nı] \ Z. u; = x;}. Intuitively, R(u, J) includes 
the input vectors which differ from u only at the indices from J. 


Note that both R(u, nı) and R(u, [n;]) denote the entire input space Bi. 

Recall that each input sample is an element from B"",. To represent the region R by 
a BDD, we transform each value +1 into a Boolean value 1/0. To this end, for each 
input u € Bip we create a new sample u® € B™ such that for every i € [nı], u; = 
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2u” — 1. Therefore, R(u, r) and R(u, I) will be represented by Ru, r) and Ru”, D, 
respectively. The transformation functions ae A A in and t4, Of the LIN, BN, BIN, and 
ARGMAX layers (cf. Table 1) will be handled accordingly. Note that for convenience, 
vectors over the Boolean domain B may be directly given by u or x when it is clear 
from the context. 


Region Encoding Under Hamming Distance. Given an input u € B” and an integer 
r, the region R(u, r) can be expressed by a cardinality constraint 2i £j < r (which is 
equivalent to EL =al; =n, — r), where for every j € [m], €; = x; if uj = 0, otherwise 


€; = =x;. For instance, consider u = (1, 1, 1,0,0) and r = 2, we have: 
HD(u, x) =10%x,+10%x.+10%3+00%x4+00X%5 = 7x, +x + 7X3 + x4 + x5. 


Thus, R((1, 1, 1,0, 0), 2) can be expressed by the cardinality constraint =x; +7x2+7x3+ 
x4 + x5 < 2, or equivalently xı + x2 + x3 + 7X4 + 7X5 > 3. 

By Algorithm 1, the cardinality constraint of R(u, r) can be encoded by the BDD 
Gir, such that L(G},) = R(u, r). Following Lemma 1, we get that: 
Lemma 2. For an input region R given by an input u € B™ and an integer r, a BDD 
Ge, with O(r- (nı —1r)) nodes can be computed in O(r (nı —r)) time such that L(G!) = 
R(u, r). 


Region Encoding Under Fixed Indices. Given an input u € B”' and a set of indices 
IC [n], the region R(u, I) = {x € B™ | Vi € [nı] \ Z. u; = x;} can be represented by the 
BDD Gi, = ANDie{n ju Ui == 1)?VaR(x;) : Nor(Var(x;))). Intuitively, Gi, states that 
the value at the index i € [n;] \ Z should be the same as the one in u while the value at 
the index i € 7 is unrestricted. For instance, consider u = (1,0,0,0) and J = {3,4}, we 
have: 


R((1, 0, 0, 0), {3, 4}) = {(1, 0, 0, 0), (1, 0, 0, 1), (1, 0, 1,0), (1,0, 1, 1)} = xı A =x2. 


Lemma 3. For an input region R given by an input u € B™ and indices I C [n\], a BDD 
Gr, with O(n, — |I|) nodes can be computed in O(n) time such that L(G") = Riu, I). 


3.4 BNN2CC: BNNs to Cardinality Constraints 


As mentioned before, to encode the BNN N = (ti, ++- , tg, tg+1) as BDDs, we transform 
the BNN N into cardinality constraints from which the desired BDDs (Ge Jiets) are 
constructed. To this end, we first transform each internal block t; : Br > Brin into nj+1 
cardinality constraints, each of which corresponds to one of the outputs of t;. Then we 
transform the output block t441 : Br — B' into s(s— 1) cardinality constraints, where 
one output class yields (s — 1) cardinality constraints. 

For each vector-valued function t, we denote by 1); the (scalar-valued) function 
returning the j-th entry of the output of t. 
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Transformation for Internal Blocks. Consider the internal block t; : BY > Bii for 
i € [d]. Recall that for every j € [nj,,] and x € B™,, ti (x) = P” a” Kx, W. j) +b;)), and 
each value +1 of an input u € Bis replaced by 1/0 (cf. Sect. 3.3). To be consistent, 
the function fj); : Br — B,, is reformulated as the function ae : B" — B such that 
for every x € B", iG) = 0.5 x (P'"(1?"((2x — 1, W. j} + b;)) + 1), where 1 denotes the 
vector of 1’s with the width n;. 

Let C; j be the following cardinality constraint: 


Dee fe 215 i t+ uj - bj- aol if a; > 0; 
aall ifæj=0A^y; 2 0; 
M 0, ifa; =OAy; <0; 
Deer Ee S15 Ou — yj + by + EN, if aj < 0; 


where for every k € [nj], € is xg if Wz j = +1, and & is wx, if Wz j = —1. 
Proposition 1. ae © Ci}. 


Proof refers to [71]. 


Transformation for the Output Block. For the output block tg,; : BY — BY’, since 


= li ; <. Na+ 
tas) = t4 O tjip then for every j € [s], we can reformulate ty,1,; : BY{'’ — B as the 


function ae . : B+! — B such that for every x € B™#!, Sane, = tas j(2x — 1). 
For every j’ € [s] \ {j}, we define the cardinality constraint Cy,,, as follows: 


Diet Caria = (bp — bj + Erei Wej- Wij) + 1 + Heg, 
c r if j’ < jand i (bj —bj+ pe (Wi; — Wij) is an integer; 
d+1,} = = 


rel Caste = [Elby — bj + DS (Wj — Wey) + #Neg, otherwise; 


where #Neg = |{k € [nasi] | Wij — Wey = —231, Caria iS Xari if Wij — We = +2, 
lasik is TWX d4+1,k if Wij = Wg y = —2, and bdsik is Oif Wij = Wi; = 0. 


ay (b) 
Proposition 2. tay; ® A jresi,j¢j Cari, j 


Proof refers to [71]. 

For each internal block t; : B’’; — Bi't, we denote by BNN2CC(t;) the car- 
dinality constraints {Ci1,--- ,Cin,,,}. For each output class j € [s], we denote by 
BNN2CC/(tq41) the cardinality constraints {Cg41,1,°+* Ca+1,j-1, Ca+1,j+15°** » Ca+1,s}- By 
applying the above transformation to all the blocks of the BNN N = (t1,--- , ta, ta+1)> 
we obtain its cardinality constraint form N = a on ae ; ey) such that for each 
i € [d], 0” = BNN2CC(t), and ¢), = (BNN2CC!(ta+1), +++ , BNN2CC*(ta+1)). Given 
an input u € B™, we denote by N by) the index j € [s] such that all the cardinality 


constraints in BNN2CC/(tg.) hold under the valuation u. It is straightforward to verify: 


Theorem 1. u € B" is classified into the class j by the BNN N iff NOM) = j. 
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Example I. Consider the BNN N = (tı, t2) with one internal block t; and one output 
block ft as shown in Fig. 5 (left-bottom), where the elements of the Weight matrix W 
are associated to the edges, and the other parameters are given in the left-up table. The 
transformation functions of blocks t; and f are given in the right-up table, and their 
cardinality constraints are given in the right-bottom table. 

For instance, for each input x € Be yı = sign(—x] + x2 + x3 + 2.7), i.e., y1 = +1 © 
=x + X2 + X3 + A > 0. By replacing x; with 2 x x®- 1 and a with 1 — a?) , we have: 
yi = +1 e (x +o? + +0.85 > 0) (ax? +o + > TA Thus we get 
yP e ax +x) +4 > 1 (note that y” =0 8 ax” +2 + < 1). Similarly, we 
can deduce that 0; © yı — y2 2 0.7, and thus 0; © Py n > 0.35 & y® + ay? > 22, 


3.5 BDD Model Builder 


The construction of the BDDs (Ge )iejs] from the BNN N (>) and the input region R 
is done iteratively throughout the blocks. Initially, the BDD for the first block is built, 
which can be seen as the input-output relation for the first internal block. In the i-th 
iteration, as the input-output relation of the first (i— 1) internal blocks has been encoded 
into the BDD, we compose this BDD with the BDD for the block t; which is built from 
its cardinality constraints ga, resulting in the BDD for the first i internal blocks. Finally, 
we obtain the BDDs (G™ Jiets) Of the BNN WN, with respect to the input region R. 


N(x) 
yı = sign(—21 + 2 + z3 + 2.7) 
y2 = sign(—a1 — x2 + 23 — 1) 
01 & yı — y2 2 0.7 
02 & yı — y2 < 0.7 


0.2 002 —05 0.02 2 


—0.5 —0.03 1.5 —0.03 3 


yı Fl a Cardinality Constraint Encoding 
2n ad y® á mr +00) +20) >i 
y2 rei 02 Ue ey Ha aa 
+1 o > yl”) + yl”) > 2 


<2 


b b 
on eo yl? +a 


Fig. 5. An illustrating example 


Design Choice. There are several design choices for efficiency consideration which 
we discuss as follows. First of all, to encode the input-output relation of an internal 
block t; into BDD from its cardinality constraints i = {Cii,°+: ,Cin,,}, we need to 
compute AND je{n,,,]CC2BDD(C; j). A simple and straightforward approach is to initially 
compute a BDD G = CC2BDD(C;,,) and then iteratively compute the conjunction G = 
Anp(G, CC2BDD(C;,;)) of G and CC2BDD(C;j, ;) for 2 < j < niyi. 

Alternatively, we use a divide-and-conquer strategy to recursively compute the 
BDDs for the first half and the second half of the cardinality constraints respectively, 
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and then apply the AND-operation. Our preliminary experimental results show that 
the latter approach often performs better (about 2 times faster) than the former one, 
although they generate the same BDD. 

Second, constructing the BDD directly from the cardinality constraints ee = 
{Ci s Cin} becomes prohibitively costly when n; and nj are large, as the BDDs 
CC2BDD(C;,;) for j € [nı] need to consider all the inputs in B™. To improve effi- 
ciency, we apply feasible input propagation. Namely, when we construct the BDD for 
the block ¢;,;, we only consider its possible inputs with respect to the output of the block 
ti. Our preliminary experimental results show that the optimization could significantly 
improve the efficiency of the BDD construction. 

Third, instead of encoding the input-output relation of the BNN WN as a sole BDD 
or MTBDD, we opt to use a family of s BDDs (Gow )ie[s}, each of which corresponds 
to one output class of N. Recall that each output class i € [s] is represented by (s — 1) 
cardinality constraints. Then, we can build a BDD G; for the output class i, similar to 
the BDD construction for internal blocks. By composing G; with the BDD of the entire 
internal blocks, we obtain the BDD G?". Building a single BDD or MTBDD for the 
BNN is possible from (G?")jcts}, but our approach gives the flexibility especially when 
a specific target class is interested, which is common for robustness analysis. 


Algorithm 2: BDD Construction for BNNs 


1 Proc BNN2BDD(BNN : N = (t),°++ , t4, ta+1), Region: R(u,T)) 
G" = Gi, (cf. Section 3.3); N® = “,--- 2,1 ) (cf. Section 3.4); 


2 
3 for (i = 1; i < d; i+ +) do t 

4 G’ =BLock2BDD(1”, G", i); 

5 G} = Exists(G’, x’) ; // x' denote input variables of w 
6 G = (i == 1) ? G’ : ReLPRroD(G, G’); 

7 for (i= 1; i < s; i+ +) do 

8 G; =BLock2BDD(t"” G", d + 1); 

9 G?" = ReLPron(G;, G); 

10 return (G?")icts] 

11 Proc BLock2BDD(CCs : {Cm, -> , Cn}, InputSpace : G”, BlkIndex : i) 

12 if n == m then 

13 Gı =CC2BDD(C,,) (cf. Algorithm 1); 

14 G = AnD(G;, G”); 

15 if i +d + 1 then G = Xnor(x'*!,G); 

16 else 

17 G, =Biock2BDD({C,,,--- , C,232 jimh G”, i); 

18 Gy =BLock2BDD({C, 25" pm1> > Cn), G", Ù); 

19 G = AnD(G, G2); 


20 return G 
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Overall Algorithm. The overall BDD construction procedure is shown in Algorithm 2. 
Given a BNN N = (ti, ++- , ta, ta+1) with s output classes and an input region R(u, T), the 
algorithm outputs the BDDs (G?"’)icts}, encoding the input-output relation of the BNN 
N with respect to the input region R(u, T). 

The procedure BNN2BDD first builds the BDD representation Ge of the input 
region R(u,T) and the cardinality constraints from BNN N ) (Line 1). The first for- 
loop builds a BDD encoding the input-output relation of the entire internal blocks 
w.r.t. GE. The second for-loop builds the BDDs (Gy Jie[s], each of which encodes the 
input-output relation of the entire BNN for a class i € [s] w.r.t. Gi. The procedure 
Biock2BDD receives the cardinality constraints {C,,,--- ,C,}, a BDD G” representing 
the feasible inputs of the block and the block index i as inputs, and returns a BDD G. If 
i = d+1, namely, the cardinality constraints {C,,,--- , Cn} are from the output block, the 
resulting BDD G encodes the subset of Ge that satisfy all the cardinality constraints 
{Cms Cn}. If i + d + 1, then the BDD G encodes the input-output relation of the 
Boolean function finn such that for every x! € L(G"), fnn(x!) is the truth vector of the 
cardinality constraints {C,,,--- , Cn} under the valuation xi. When m = 1 and n = ni1, 
fnn is the same as ae hence L(G) = {xix xt! € G” x B™ | 1P (x') = x't!}. Detailed 
explanation refers to [71]. 


Theorem 2. Given a BNN N with s output classes and an input region R(u, T), we can 
compute s BDDs (Ge iets] such that the BNN N classifies an input x € R(u,T) into the 
class i € [s] if x® € L(G"). 


Algorithm 2 explicitly involves O(d + s) RELPRop-operations, O(s? + Dieta) ni) AND- 
operations and O(d) Exists-operations. 


4 Applications: Robustness Analysis and Interpretability 


In this section, we present two applications within BDD4BNN, i.e., robustness analysis 
and interpretability of BNNs. 


4.1 Robustness Analysis 


Definition 2. Given a BNN N and an input region R(u, T), the BNN is (locally) robust 
wrt. the region R(u, T) if each sample x € R(u, T) is classified into the same class as the 
ground-truth class of u. 

An adversarial example in the region R(u,T) is a sample x € R(u,tT) such that x is 
classified into a class, that differs from the ground-truth class of u. 


As mentioned in Sect. 1, qualitative verification which checks whether a BNN is 
robust or not is insufficient in many practical applications. In this paper, we are inter- 
ested in quantitative verification of robustness which asks how many adversarial exam- 
ples are there in the input region of the BNN for each class. To answer this question, 
given a BNN N and an input region R(u,T), we first obtain the BDDs (Ge iets] by 
applying Algorithm 2 and then count the number of adversarial examples for each class 
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in the input region R(u, T). Note that counting adversarial examples amounts to com- 
puting |R(u, T)| — ILG"), where g denotes the ground-truth class of u, and L(G" )| 
can be computed in time OG" |). 

In some applications, more refined analysis is needed. For instance, it may be 
acceptable to misclassify a dog as a cat, but unacceptable to misclassify a tree as a car. 
This suggests that the robustness of BNNs may depend on the classes to which samples 
are misclassified. To capture this, we consider the notion of targeted robustness. 


Definition 3. Given a BNN N, an input region R(u,t), and the class t, the BNN is t- 
target-robust w.rt. the region R(u, T) if every sample x € R(u,T) is never classified into 
the class t. (Note that we assume that the ground-truth class of u differs from the class t.) 


The quantitative verification problem of t-target-robustness of a BNN asks how 
many adversarial examples in the input region R(u, T) are misclassified to the class t by 
the BNN N. To answer this question, we first obtain the BDD Ge" by applying Algo- 
rithm 2 and then count the number of adversarial examples by computing |L(G?"")|. 

Note that, if one wants to compute the (locally) maximal safe Hamming distance 
that satisfies a robustness property for an input sample (e.g., the proportion of adversar- 
ial examples is below a threshold), our framework can incrementally compute such a 
distance without constructing the BDD models of the entire BNN from scratch. 


Definition 4. Given a BNN N, input region R(u,r) and threshold € > Q, rı is the 
(locally) maximal safe Hamming distance of R(u, 7T), if one of the follows holds: 


— if Pr(R(u,r)) > €, then Pr(R(u,r1)) < € and Pr(R(u,r’)) > €forr : r <r <r; 

— if Pr(R(u,r)) < 6, then Pr(R(u, rı + 1)) > € and Pr(Rtu,r’)) < €forr’:r<r <r; 
where Pr(R(u, r)) is the probability Zie IGT! for g being the ground-truth class of 
u, assuming a uniform distribution of adversarial samples. 


Algorithm 3 shows the procedure to incrementally compute the maximal safe Ham- 
ming distance for a given threshold e > 0, input region R(u, r), and ground-truth class 
g of u. Remark that Pr(R(u, r)) may not be monotonic w.r.t. the Hamming distance r. 


4.2 Interpretability 


In general, interpretability addresses the question of why some inputs in the input region 
are (mis)classified by the BNN into a specific class? We consider the interpretability of 
BNNs using two complementary explanations, i.e., prime implicant explanations and 
essential features. 


Definition 5. Given a BNN N, an input region R(u, T) and a class g, a prime implicant 
explanation (Pl-explanation) of decisions made by the BNN N on the inputs L(G") 
is a minimal set of literals {€,,--- ,€} such that for every x € R(u,t), if x satisfies 
li A+++ A fp then x is classified into the class g by the BNN N. 
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Algorithm 3: Compute the maximal safe Hamming distance 


1 Proc MaAxHD(BNN : N = (t),--: ,ta,ta+1), Region: R(u,r), Threshold: e,Class: g) 
2 (G? iets) =BNN2BDD(N, R(u, r)); 


3 if (Zeelaan > €) then // decrease r 

4 while (r > 0) do 

5 r=r-1; 

6 (G iets) = (AND(G®,, Ge"))iets13 

7 if aie < €) then return r; 

8 else // increase r 

9 while (r < nı)do // ny is the input size of the BNN N 
10 r=rtl; 
u (B" ies] =BNN2BDD(N, R(u, r) \ Ru, r — 1); 

2 (GP Viet) = (ORB, GP" Viet; 

13 if Cae > €) then return r — 1; 

14 return r 

Intuitively, a PI-explanation {¢,,--- , €k} indicates that {var(¢,),--- , var(€,)} are key 


features, namely, if fixed, the predication is guaranteed no matter how the remaining 
features change. Remark that there may be more than one PI-explanation for a set of 
inputs LG ). When g is set to be the class of the benign input u, a PI-explanation on 
G3 suggests why these samples are classified into g by the BNN N. 


Definition 6. Given a BNN N, an input region R(u, T) and a class g, the essential fea- 
tures for the inputs L(G") are literals {€,,--+ , €k} such that every x € R(u,T), if x is 
classified into the class g by the BNN N, then x satisfies £1 A+++ A £k. 


Intuitively, the essential features {f),--- ,€} denote the key features such that all 
samples x € R(u,T) that are classified into the class g by the BNN N must agree on 
these features. Essential features differ from PI-explanations, where the former can be 
seen as a necessary condition, while the latter can be seen as a sufficient condition. 

BDD libraries (e.g., CUDD [58]) usually provide APIs to identify prime impli- 
cants (e.g., Cudd_bddPrintCover and Cudd_FirstPrime) and essential variables (e.g., 
Cudd_FindEssential). Therefore, prime implicants and essential features can be com- 
puted via queries on the BDDs (Ge iets]. 


5 Evaluation 


We have implemented our framework as a prototype tool BDD4BNN based on the 
CUDD package [58]. BDD4BNN is implemented with Python as the front-end to pre- 
process BNNs and C++ as the back-end to perform the BDD encoding and analysis. 
In this section, we report the experimental results, including BDD encoding, robustness 
analysis based on hamming distance, and interpretability. 
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Experimental Setup. The experiments were conducted on a machine with Intel Xeon 
Gold 5118 2.3GHz CPU, 64-bit Ubuntu 20.04 LTS operating systems, 128G RAM. 
Each BDD encoding executed on one core limited by 8-h. 


Benchmarks. We use the PyTorch (v1.0.1.post2) deep learning platform provided by 
NPAQ [6] to train and test BNNs. We trained 12 BNN models (P1-P12) with varying 
sizes using the MNIST dataset [35]. The MNIST dateset contains 70,000 gray-scale 28 
x 28 images (60,000 for training and 10,000 for testing) of handwritten digits with 10 
classes. In our experiments, we downscale the images (28 x 28) to some selected input 
size nı (i.e., the corresponding image is of the size yny x -Yn;) and then binarize the 
normalized pixels of the images. 

Details of the BNN models are listed in Table 3, each of which has 10 classes (i.e., 
s = 10). Column 1 shows the name of the BNN model. Column 2 shows the architecture 
of the BNN model, where nı : +-+- : ngi; : s denotes that the BNN model has d + 1 
blocks, nı inputs and s outputs; the i-th block for i € [d + 1] has n; inputs and n;,; 
outputs with ny. = s. Recall that each internal block has 3 layers while the output 
block has 2 layers. Therefore, the number of layers ranges from 5 to 14, the dimension 
of inputs ranges from 9 to 784, and the number of hidden neurons per linear layer ranges 
from 10 to 100. Column 3 shows the accuracy of the BNN model on the test set of the 
MNIST dataset. (We can observe that the accuracy increases with the size of inputs, the 
number of layers, and the number of hidden neurons per layer.) We randomly choose 
10 images from the training set of the MNIST dataset (one image per class) to evaluate 
our approach. 


5.1 Performance of BDD Encoding 
We evaluate BDD4BNN on the BNNs listed in Table 3 using different input regions. 


BDD Encoding Using Full Input Space. We evaluate BDD4BNN on the BNNs (P1- 
P5), where Br is used as the input region. The results are shown in Table 4, where |G] 
denotes the number of BDD nodes in the BDD manager. We can observe that both the 
execution time and the number of BDD nodes increase with the size of BNNs. 


BDD Encoding Under Hamming Distance. We evaluate BDD4BNN on the BNNs 
(P5—P12). In this case, an input region is given by one of the 10 images and a Hamming 
distance r ranging from 2 to 6. The average results are shown in Table 5, where [i] (resp. 
(i)) indicates the number of cases that BDD4BNN runs out of memory (resp. time). 
Overall, the execution time and the number of BDD nodes increase with r. BDD4BNN 
succeeded on all the cases when r < 4, 75 cases out of 80 when r = 5, and 48 cases 
out of 80 when r = 6. We observe that the execution time and number of BDD nodes 
increase with the number of hidden neurons (P6 vs. P7, P8 vs. P9, and P11 vs. P12), 
while the effect of the number of layers is diverse (P6 vs. P8 vs. P10, and P7 vs. P9). 
From P9 and P10, we observe that the number of hidden neurons per layer is likely 
the key impact factor of the efficiency of BDD4BNN. Interestingly, our tool BDD4BNN 
works well on BNNs with large input sizes (i.e., on P11 and P12). 
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Table 3. BNN benchmarks 


Name Architecture Accuracy | Name Architecture Accuracy 
Pl 9:20:10 12.23% | P7 100:100:10 75.16% 
P2 16:32:10 28.63% | P8 100:50:20:10 71.1% 


P3 16:64:32:10 25.14% | P9 100:100:50:10 77.37% 
P4 = 36:15:10:10 27.12% | P10 100:50:30:30:10 80.63% 
P5 64:10:10 49.16% | P11 784:30:50:50:50:10 88.23% 
P6 100:50:10 73.25% | P12 784:50:50:50:50:10 86.95% 


Table 4. BDD encoding using full input space 


Name |P1 | P2 P3 P4 P5 
Time (s)| 0 (0.78 |28.21 | 10924.51 Timeout 
IGI 288 | 18,864 | 17,636 | 152,830,875 | — 


These results demonstrate the efficiency and scalability of BDD4BNN on BDD 
encoding of BNNs. We remark that, compared with the learning-based approach [54], 
our approach is considerably more efficient and scalable. For instance, the learning- 
based approach takes 403 s to encode a BNN with 64 input size, 5 hidden neurons, and 
2 output size when r = 6, while ours takes about 3 s even for a larger network P5. 


5.2 Robustness Analysis 


We evaluate BDD4BNN on the robustness of BNNs, including robustness analysis under 
different input regions and maximal safe Hamming distance computing. 


Robustness Verification with Hamming Distance. We evaluate BDD4BNN on BNNs 
(P7, P8, P9, and P11) using the 10 images. The input regions are given by the Hamming 
distance r ranging from 2 to 4, resulting in 120 instances. To the best of our knowledge, 
NPAQ [6] is the only work that supports quantitative robustness verification of BNNs to 
which we compare BDD4BNN. Recall that NPAQ only provides PAC-style guarantees. 
Namely, it sets a tolerable error £ and a confidence parameter ô. The final estimated 
results of NPAQ have the bounded error ¢ with confidence of at least 1 — ô, i.e., 


Pr{(1 +e) 'RealNum < EstimatedNum < (1 + «)RealNum] > 1-6 (1) 


In our experiments, we set € = 0.8 and 6 = 0.2, as done in [6]. 
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Table 5. BDD encoding under Hamming distance 


r=2 r=3 r=4 r=5 r=6 

Time(s) IG| | Time(s) |G| | Time(s) IGI Time(s) IGI Time(s) IGI 
P5 0.01 1,559 0.03 9,795 0.11 36,796 0.74 176,107 2.94 592,104 
P6 0.25 4,670 4.17 84,037 | 109.26 1,018,571 2,292.5 11,375,842 | (5) 17,811 41,883,970 
P7 0.65 5,295 22.70 106,754 | 652.78 1,575,722 (1) 17,399 16,163,078 [10] - 
P8 0.14 6,147 1.95 125,226 44.51 1,668,027 1,146.8 20,519,582 | (1) 12,491 172,369,297 
P9 1.99 6,139 63.30 136,126 | 1,428.6 2,005,666 | [1](3) 17,039 29,323,244 [10] - 
P10 0.30 4,630 4.87 100,054 | 101.41 1,603,920 1,909.9 19,844,299 | (5) 20,484 173,316,483 
Pil 552 3,128 513022120 6.60 86,413 11.63 556,774 238.2 2,881,468 
P12 12.4 5,693 12.87 49,996 16.92 493,820 403.09 5,739,602 | (1) 11,058 16,241,733 


Table 6. Robustness verification under Hamming distance 


NPAQ [6] BDD4BNN Diff 
F #(Adv) Time(s) Pr(adv) #(Adv) Time(s) Pr(adv) #(Adv) Speed Up 
2 875 271.07 17.32% 1,806 0.65 35.76% | 106.4% 416 
r E 39,587 919.88 23.74% 65,054 22.71 39.01% | 64.33% 40 
4 | 1,023,798 3,862.0 25.04% | 1,501,691 661.79 36.73% | 46.68% 3 
2 1,601 187.78 31.70% 2,261 0.14 44.76% | 41.22% 1,340 
PS E 66,562 396.45 39.92% 64,372 1.96 38.60% | -3.29% 201 
4 | 1,636,070 1,861.7 40.02% | 1,829,103 45.0 44.74% | 11.80% 40 
2 1,214 363.44 24.03% 1,406 1.99 27.84% | 15.82% 182 
Po BB 51,464 3,763.6 30.86% 42,901 63.31 25.73% | -16.64% 58 
4 | 1,316,181 (1) 9,007.8 32.20% | 3,968,609 1,505.0 97.08% | 201.5% 5 
2 12,083 3,831.0 3.93% 28,736 5.52 9.34% | 137.8% 693 
Pll E 0 (2) 4,634.2 0% 0 5.68 0% - 815 
4 O (2) 7,979.1 0% 0 6.38 0% - 1,250 


The results on the average of the images are shown in Table 6. NPAQ ran out of time 
on 5 instances (which occur in P9 with r = 4 and P11 with r = 3 andr = 4), while 
BDD4BNN successfully verified all the 120 instances. Table 6 only shows the results of 
115 instances that can be solved by NPAQ. Columns 3, 4, and 5 (resp. 6, 7, and 8) show 
the number of adversarial examples, the execution time, and the proportion of adver- 
sarial examples in the input region. Column 9 shows the error rate ¢24Nu—EstimatedNum | 
where RealNum is from our result, and EstimatedNum is from NPAQ. Column 10 
shows the speedup of BDD4BNN compared with NPAQ. Remark that the numbers of 
adversarial examples are 0 for P11 on input regions with r = 3 and r = 4 that can be 
solved by NPAQ. There do exist input regions for P11 that cannot be solved by NPAQ 
but have adversarial examples (see below). On BNNs that were solved by both NPAQ 
and BDD4BNN, BDD4BNN is significantly (5x to 1,340x) faster and more accurate 
than NPAQ. From Table 5 and Table 6, we also found that most of the verification time 
is spent on BDD encoding while the rest is usually less than 10s. 


Details of Robustness and Targeted Robustness. Figure 6(a) (resp. Fig. 6(b) and 
Fig. 6(c)) depicts the distributions of classes on P8 with Hamming distance r = 2 (resp. 
P8 with r = 3 and P11 with r = 2), where on the x-axis i = 0,--- ,9 denotes the input 
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Fig. 6. Details of robustness verification with Hamming distance 


region that is within the respective Hamming distance to the image of digit i (called 
i-region). We can observe that P8 is robust for the 0-region when r = 2 and robust for 
the 6-region when r = 2 and r = 3, but is not robust for the other regions. (Note P8 
is not robust for 0-region when r = 3, which is hard to be visualized in Fig. 6(b) due 
to the small number of adversarial examples.) Most of the adversarial examples in the 
l-region and 5-region are misclassified into the digit 3 by P8. P11 is not robust for the 
l-region or the 5-region, but is robust for all the other regions. Though P8 and P11 are 
not robust on some input regions, indeed they are t-target-robust for many target classes 
t, e.g., P11 is t-target-robust for the 1-region when ż + 2, and the 5-region when ż # 3. 
(The raw data are given in [71].) 


Quality Validation of NPAQ. Figure 6(d) shows the distribution of error rates of NPAQ, 
where the x-axis is the range of the error rate and the y-axis is the corresponding number 
of instances. There are 19 instances where the estimated number of adversarial exam- 
ples exceeds (1+e€) of the real number of the adversarial examples and 7 instances where 
the estimated number of adversarial examples is less than (1 + €)~! of the real number 
of the adversarial examples. This means that out of 115 instances, only in 89 instances 
the estimated number is within the allowed range, which is less than 1 — 6 = 0.8. 


Maximal Safe Hamming Distance. As a representative of such an analysis, we eval- 
uate BDD4BNN on 4 BNNs (P7, P8, P9, and P11) with 10 images for 2 robustness 
thresholds (e = 0 and € = 0.03). The initial Hamming distance r is 3. Intuitively, € = 0 
(resp. € = 0.03) means that up to 0% (resp. 3%) samples in the input region can be 
adversarial. 

Table 7 shows the results, where columns SD and Time give the maximal safe Ham- 
ming distance and the execution time, respectively. BDD4BNN solved 74 out of 80 
instances. (For the remaining 6 instances, BDD4BNN ran out of time or memory, but 
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Table 7. Maximal safe Hamming distance 


P7 P8 P9 Pll 
Image e=0 e = 0.03 e=0 e = 0.03 e=0 e = 0.03 e=0 e = 0.03 

SD Time(s) | SD Time(s) | SD Time(s) | SD Time(s) | SD Time(s) | SD Time(s) | SD Time(s) | SD Time(s) 
0 1 15.09; 4 10,845 |) 2 0.51 | 6 Timeout) 3 746.15] 3 737.96 | 6 29.69 | 6 29.28 
1 -1 19.96 | - 9.13 | - 2.84 | -1 2.97 | 0 155.50 | 0 155.09 | 0 649| 0 6.11 
2 2 1325| 3 42204] 0 0.46 | 0 0.50} 1 37.50 | 4 14,127 | 6 11,334 | 6 11,437 
3 0 21.39) D 20.94 | - 1.92 | -1 2.08 | 0 41.04 0 40.49 | 6 8,323.1 | 6 8,088.3 
4 3 426.81 | 5 OOM | - 2.41 | -1 2.61 | 2 8.08 | 5 OOM | 6 30.85 | 6 30.74 
5 -1 15.60 | - 5.92 | - 0.68 | -1 0.74 | -1 22.54 | -1 21.54 | -1 7.03 | -1 6.72 
6 4 7,990.6 5 OOM | 3 5.69 | 4 198.26 | 1 57.37 | 4 Timeout | 6 44.57 6 45.12 
7 -1 16.08 | - 5.90 | - 2.49 | -1 252| 1 89.49| 4 Timeout| 6 89.38 | 6 88.39 
8 -1 19.02 | - 9.28 | - TI 1.80 | -1 80.16 | -1 79916 43.95 | 6 43.30 
9 0 26.82 | 0 27.69 | 0 5.09 | 1 5.39 | -1 109.04 | -1 107.24 | 6 338.73) 6 32748 


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 20 25 


(a) EFs for class 2 (b) EFs for class 5 (c) PI for class 2 (d) PI for class 5 


Fig. 7. Graphic representation of essential features and PI-explanations 


it was still able to compute a larger safe Hamming distance.) We can observe that the 
maximal safe Hamming distance increases with the threshold € on several BNNs and 
input regions. We can also observe that P11 is more robust than others, which is con- 
sistent with their accuracies (cf. Table 3). Remark that SD = —1 indicates that the input 
image itself is misclassified. 


5.3 Interpretability 


To demonstrate the ability of BDD4BNN on interpretability, we consider the analysis of 
the BNN P12 and the image u of digit 1. 


Essential Features. For the input region given by the Hamming distance r = 4, we 
compute two sets of essential features for the inputs L(G") and L(G), i.e., the 
adversarial examples in the region R(u, 4) that are misclassified into the classes 2 and 
5 respectively. The essential features are depicted in Figs. 7(a) and 7(b), where black 
(resp. blue) color means that the value of the corresponding pixel is 1 (resp. 0), and 
yellow color means that the value of the corresponding pixel can take arbitrary values. 
Figure 7(a) (resp. Fig. 7(b)) indicates that the inputs £(G$") (resp. L(G3")) must agree 
on these black- and blue-colored pixels. 
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PI-Explanations. For demonstration, we assume that the input region is given by the 
fixed set of indices J = {1,2,--- ,28} which denotes the first row of pixels of 28 x 28 
images. We compute two PlI-explanations of the inputs £(G$“") and L(G). The PI- 
explanations are depicted in Figs. 7(c) and 7(d). Figure 7(c) (resp. Fig. 7(d)) suggests 
that, by the definition of the PI-explanation, all the images in the region R(u, J) obtained 
by assigning arbitrary values to the yellow-colored pixels are always misclassified into 
the class 2 (resp. class 5), while changing one black-colored or blue-colored pixel would 
change the predication result since a PI-explanation is a minimal set of literals. 


6 Related Work 


In this section, we discuss the related work on qualitative/quantitative analysis and inter- 
pretability of DNNs. As there is a vast amount of literature regarding these topics, we 
will only discuss the most related ones to BDD4BNN. 


Qualitative Analysis of DNNs. For real-numbered DNNs, various formal verifica- 
tion approaches have been proposed. Typical examples include constraint solving based 
approaches [17,26,30,31,51], optimization based approaches [10, 13,15, 16,40,61,67, 
68], and program analysis based approaches [2,3, 18, 20,37—39, 55-57, 62-64, 69]. 

Existing techniques for quantized DNNs are mostly based on constraint solving, 
in particular, SAT/SMT solving [12,33,45,46]. Following this line, verification of 
BNNs with ternary weights [28,48] and quantized DNNs with multiple bits [7,22,24] 
were also studied. Recently, the SMT-based framework Marabou for real-numbered 
DNNs [31] has also been extended to support BNNs [1]. 


Quantitative Analysis of DNNs. Comparing to qualitative analysis, quantitative anal- 
ysis of neural networks is currently very limited. Two sampling-based approaches were 
proposed to certify the robustness for both DNNs and BNNs [5,65]. Yang et al. [69] 
proposed a spurious region-guided refinement approach for real-numbered DNN verifi- 
cation, claiming to be the first work of the quantitative robustness verification of DNNs 
with soundness guarantees. 

Following the SAT-based qualitative analysis of BNNs [45,46], SAT-based quan- 
titative analysis approaches were also proposed [6,21,47]. In particular, approximate 
SAT model-counting solvers are utilized. Shih et al. [54] also proposed a BDD-based 
approach to tackle BNNs, similar to our work in spirit. However, our approach is able 
to handle BNNs of considerably larger sizes than their learning-based method. 


Interpretability of DNNs. Though interpretability of DNNs is crucial for explain- 
ing predictions, it is very challenging to tackle due to the blackbox nature of DNNs. 
There is a large body of work on the interpretability of DNNs (cf. [25,43] for a survey). 
Almost all the existing approaches are heuristic-based and restricted to finding explana- 
tions that are local in an input region. Some of them tackle the interpretability of DNNs 
by learning an interpretable model, such as binary decision trees [19,70] or finite-state 
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automata [66]. In contrast to ours, they target at DNNs and only approximate the origi- 
nal model in the input region. The BDD-based approach [54] mentioned above has been 
used to compute the PI-explanation, but essential features were not considered therein. 


7 Conclusion 


In this paper, we have proposed a novel BDD-based framework for quantitative verifica- 
tion of BNNs. We implemented the framework as a prototype tool BDD4BNN and con- 
ducted extensive experiments on 12 BNN models with varying sizes and input regions. 
Experimental results demonstrated that BDD4BNN is more scalable than the existing 
BDD-learning based approach, and significantly more efficient and accurate than the 
existing SAT-based approach NPAQ. This work represents the first, but a key, step of the 
long-term program to develop an efficient and scalable BDD-based quantitative analysis 
framework for BNNs. 
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Abstract. State-of-the-art program-analysis techniques are not yet able 
to effectively verify safety properties of heterogeneous systems, that is, 
systems with components implemented using diverse technologies. This 
shortcoming is pinpointed by programs invoking neural networks despite 
their acclaimed role as innovation drivers across many application areas. 
In this paper, we embark on the verification of system-level properties for 
systems characterized by interaction between programs and neural net- 
works. Our technique provides a tight two-way integration of a program 
and a neural-network analysis and is formalized in a general framework 
based on abstract interpretation. We evaluate its effectiveness on 26 vari- 
ants of a widely used, restricted autonomous-driving benchmark. 


1 Introduction 


Software is becoming increasingly heterogeneous. In other words, it consists of 
more and more diverse software components, implemented using different tech- 
nologies such as neural networks, smart contracts, or web services. Here, we 
focus on programs invoking neural networks, in response to their prominent role 
in many upcoming application areas. Examples from the forefront of innovation 
include a controller of a self-driving car that interacts with a neural network 
identifying street signs [43,48], a banking system that consults a neural network 
for credit screening [3], or a health-insurance system that relies on a neural net- 
work to predict people’s health needs [51]. There are growing concerns regarding 
the effects of integrating such heterogeneous technologies [40]. 

Despite these software advances, state-of-the-art program-analysis techniques 
cannot yet effectively reason across heterogeneous components. In fact, program 
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analyses today focus on homogeneous units of software in isolation; for instance, 
to check the robustness of a neural network (e.g., [37,27,36,66,65,64,57,41]), 
or safety of a program invoking a neural network while—conservatively, but 
imprecisely—treating the neural network as if it could return arbitrary values. 
This is a fundamental limitation of prevalent program-analysis techniques, and 
as a result, we cannot effectively analyze the interaction between diverse com- 
ponents of a heterogeneous system to check system properties. 


Many properties of heterogeneous systems depend on components correctly 
interacting with each other. For instance, consider a program that controls the 
acceleration of an autonomous vehicle by invoking a neural network with the 
current direction, speed, and LiDAR image of the vehicle’s surroundings. One 
might want to verify that the vehicle’s speed never exceeds a given bound. Even 
such a seemingly simple property is challenging to verify automatically due to 
the mutual dependencies between the two components. On the one hand, the 
current vehicle direction and speed determine the feasible inputs to the neural 
network. On the other hand, the output of the neural network controls the 
vehicle acceleration, and thereby, the speed. To infer bounds on the speed (and 
ultimately prove the property), an automated analysis should therefore analyze 
how the two components interact. 


Our approach. In this paper, we make the first step in verifying safety of het- 
erogeneous systems, and more specifically, of programs invoking neural networks. 
Existing work on verification of neural networks has either focused on the net- 
work itself (e.g., with respect to robustness) or on models (e.g., expressed using 
differential equations) that invoke the network, for example as part of a hybrid 
system [24,59]. In contrast, our approach is designed for verifying safety of a C 
(or ultimately LLVM) program interacting with the network. In comparison to 
models, C programs are much more low-level and general, and therefore require 
an intricate combination of program and neural-network analyses. 


More specifically, our approach proposes a symbiotic combination of a pro- 
gram and a neural-network analysis, both of which are based on abstract inter- 
pretation [18]. By treating the neural-network analysis as a specialized abstract 
domain of the program analyzer, we are able to use inferred invariants for the 
neural network to check system properties in the surrounding program. In other 
words, the program analysis becomes aware of the network’s computation. For 
this reason, we also refer to the overall approach as neuro-aware program anal- 
ysis. In fact, the program and neural-network analyses are co-dependent. The 
former infers sets of feasible inputs to the neural network, whereas the latter 
determines its possible outputs given the inferred inputs. Knowing the possible 
neural-network outputs, in turn, enables proving system safety. 


We evaluate our approach on 26 variants of RACETRACK, a benchmark from 
related work that originates in AI autonomous decision making [4,5,32,46,52,53]. 
RACETRACK is about the problem of navigating a vehicle on a discrete map to 
a goal location without crashing into any obstacles. The vehicle acceleration (in 
discrete directions) is determined by a neural network, which is invoked by a 
controller responsible for actually moving the vehicle. In Sect. 4, we show the 


Automated Safety Verification of Programs Invoking Neural Networks 203 


effectiveness of our approach in verifying goal reachability and crash avoidance 
for 26 RACETRACK variants of varying complexity. These variants constitute a 
diverse set of verification tasks that differ both in the neural network itself and 
in how and for what purpose the program invokes the neural network. 

Despite our evaluation being focused on this setting, the paper’s contribution 
should not be mistaken as being about RACETRACK verification. Instead, it is 
about neuro-aware program analysis of heterogeneous systems for autonomous 
decision making. While RACETRACK is a substantially simplified blueprint for 
the autonomous-driving context, it features the crucial co-dependent program 
architecture that is characteristic across the entire domain. 

Contributions. Overall, we make the following contributions: 


1. We present the first symbiotic combination of program and neural-network 
analyses for verifying safety of heterogeneous systems. 

2. We formalize neuro-aware program analysis in a general framework that uses 
specialized abstract domains. 

3. We evaluate the effectiveness of our approach on 26 variants of a widely used, 
restricted autonomous-driving benchmark. 


2 Overview 


We now illustrate neuro-aware program analysis on a high level by describing 
the challenges in verifying safety for a variant of the RACETRACK benchmark. 
This variant serves as our running example for the class of programs that invoke 
one or more neural networks to perform a computation affecting program safety. 

In general, RACETRACK is a heterogeneous system that simulates the prob- 
lem of navigating a vehicle to a goal location on a discrete map without crashing 
into any obstacles. It consists of a neural network, which predicts the vehi- 
cle acceleration toward discrete directions, and a controller (implemented in C) 
that actually moves the vehicle on the map. Alg. 1 shows pseudo-code for our 
running example, a variant of RACETRACK that incorporates additional non- 
deterministic noise to make verification harder. 

Line 1 non-deterministically selects a state from the map as the currentState, 
and line 2 assumes it is a start state for the vehicle, i.e., it is neither a goal nor 
an obstacle. On line 3, we initialize the result of navigating the vehicle as stuck, 
i.e., the vehicle neither crashes nor does it reach a goal. The loop on line 5 
iterates until either a predefined number of steps N is reached or the vehicle is 
no longer stuck (i.e., crashed or at a goal state). The if-statement on line 6 adds 
non-determinism to the controller by either zeroing the vehicle acceleration or 
invoking the neural network (NN) to make a prediction. Such non-deterministic 
noise illustrates one type of variant we created to make the verification task more 
difficult (see Sect. 4.1 for more details on other variants used in our evaluation). 
Line 10 moves the vehicle to a new currentState according to acceleration, and 
the if-statement on line 11 determines whether the vehicle has crashed or reached 
a goal. The assertion on line 16 denotes the system properties of goal reachability 
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Algorithm 1: An example RACETRACK variant. 


1 currentState < x 
2 assume ISSTARTSTATE(currentState) 
3 result < stuck 


14¢0 

5 while i < N and result = stuck do 
6 if x then 

7 acceleration < 0 

s else 

9 acceleration + NN(currentState) 
10 ~— currentState + Move(currentState, acceleration) 
us if IsCRAsH(currentState) then 

12 result + crash 

is else if ISGOAL(currentState) then 
14 result + goal 


15 i — i+1 
16 assert result = goal 


and crash avoidance. In case this assertion does not hold but we do prove the 
result to be stuck, then we have only verified crash avoidance. 

Note that these are safety, and not liveness, properties due to the bounded 
number of loop iterations (line 5)—N is 50 in our evaluation, thus making 
bounded model checking [8,15] intractable. 

Challenges. Verifying safety of this heterogeneous system with state-of-the-art 
program-analysis techniques, such as abstract interpretation, is a challenging 
endeavor. 

When considering the controller in isolation, the analysis is sound if it as- 
sumes that the neural network may return any output (T). More specifically, 
the abstract interpreter can ignore the call to the neural network and simply 
havoc its return value (i.e., consider a non-deterministic value). In our running 
example, this means that any vehicle acceleration is possible from the perspec- 
tive of the controller analysis. Therefore, it becomes infeasible to prove a system 
property such as crash avoidance. In fact, in Sect. 4, we show this to be the case 
even with the most precise controller analysis. 

On the other hand, when considering the neural network in isolation, the 
analysis must assume that any input is possible (T) even though this is not 
necessarily the case in the context of the controller. More importantly, without 
analyzing the controller, it becomes infeasible to prove properties about the 
entire system; as opposed to properties of the neural network, such as robustness. 
Our approach. To address these issues, our approach tightly combines the con- 
troller and neural-network analyses in a two-way integration based on abstract 
interpretation. 

In general, an abstract interpreter infers invariants at each program state 
and verifies safety of an asserted property when it is implied by the invariant 
inferred in its pre-state. In the presence of loops, as in RACETRACK (line 5 in 
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Alg. 1), inference is performed for a number of iterations in order to reach a 
fixpoint, that is, infer invariants at each program state that do not change when 
performing additional loop iterations. 

For our running example, to compute the fixpoint of the main loop, the con- 
troller analysis invokes the neural-network analysis instead of simply abstracting 
the call to the neural network by havocking its return value. The invariants in- 
ferred by the controller analysis in the pre-state of the call to the network are 
passed to the neural-network analysis; they are used to restrict the input space 
of the neural network. In turn, the invariants that are inferred by the neural- 
network analysis are returned to the controller analysis to restrict the output 
space. This exchange of verification results at analysis time significantly improves 
precision. By making the program analysis aware of the network’s computation, 
neuro-aware program analysis is able to prove challenging safety properties of 
the entire system. 

Our implementation combines off-the-shelf, state-of-the-art abstract inter- 
preters, namely, CRAB [34] for the controller analysis and DEEPSYMBOL [41] 
or ERAN [27,56,57| for the neural-network analysis. CRAB® is a state-of-the-art 
analyzer for checking safety properties of LLVM bitcode programs. Its modular 
high-level architecture is similar to many other abstract interpreters, such as 
Astrée [9], Clousot [26], and Infer [11], and it supports a wide range of different 
abstract domains, such as Intervals [17], Polyhedra [19], and Boxes [33]. Special- 
ized neural-network analyzers, such as DEEPSYMBOL or ERAN, have only very 
recently been developed to deal with the unique challenges of precisely check- 
ing robustness of neural networks; for instance, the challenge of handling the 
excessive number of “branches” induced by cascades of ReLU activations. 

The technical details of this combination are presented in the following sec- 
tion. Note, however, that our technical framework does not prescribe a neural- 
network analysis that is necessarily based on abstract interpretation. Specifically, 
it could integrate any sound analysis that, given 
a set of (symbolic) input states, produces a set of 
output states over-approximating the return value 
of the neural network. We also discuss how our ap- 
proach may integrate reasoning about other com- 
plex components, beyond neural networks. Our 
program analysis is also not inherently tied to 
CRAB, but could be performed by other abstract 
interpreters that use the same high-level architec- 
ture, such as Astrée [9]. 

The RACETRACK map on the right, which is 
borrowed from related work [5,32], shows the ver- 
ification results achieved by our approach when 
combining CRAB and DEEPSYMBOL. Gray cells 
marked with ‘x’ denote obstacles, and yellow cells 
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marked with ‘g’ denote goal locations. Recall from Alg. 1 that we can consider 
any cell, which is neither an obstacle nor a goal, as a possible start location. 

In our evaluation, we run a separate analysis for each possible start state to 
identify all start locations from which the vehicle is guaranteed to reach a goal; in 
other words, the analysis tries to prove that result = goal holds (line 16 of Alg. 1) 
for each possible start location. Note that verifying a single start state already 
constitutes a challenging verification problem since, due to noise, the number of 
reachable states grows exponentially in the number of loop iterations (the vehicle 
can navigate to any feasible position). This setting of one start state is common 
in many reinforcement-learning environments, e.g., Atari games, Procgen [16], 
OpenAI Gym MiniGrid®, etc. 

Maps like the above are used throughout the paper to display the outcome of 
a verification process per cell. We color locations for which the process succeeds 
green in all shown maps. Similarly, we color states from which the vehicle might 
crash into an obstacle red; i.e., one or more states reachable from the start state 
may lead to a crash, and the analysis is not able to show that result Æ crash 
holds before line 16. Finally, states from which the vehicle is guaranteed not to 
crash but might not reach a goal are colored in blue; i.e., the analysis is able to 
show that result 4 crash holds before line 16, but it is not able to show that 
result £ stuck also holds. 

As shown in the map, our approach is effective in verifying goal reacha- 
bility and crash avoidance for the majority of start locations. Moreover, the 
verification results are almost identical when combining CRAB with a different 
neural-network analyzer, namely ERAN (see Sect. 4). Note that, since the anal- 
ysis considers individual start states, the map may show a red start state that 
is surrounded by green start states. One explanation for this is that the vehicle 
never enters the red state from the surrounding green states or that it only en- 
ters the red state with a “safe” velocity and direction—imagine that the vehicle 
velocity when starting from the red state is always 2, whereas when entering it 
from green states, the velocity is always less. In general, whether a trajectory is 
safe largely depends on the neural-network behavior, which can be brittle. 


3 Approach 


As we discussed on a high level, our approach symbiotically combines an existing 
program analysis (PA) with a neural-network analysis (NNA). The result. is 
a neuro-aware program analysis (NPA) that allows for precisely analyzing a 
program that invokes neural networks (see Fig. 1). In the following, we focus on 
a single network to keep the presentation simple. As shown in Fig. 1, the two 
existing analyses are extended to pass information both from PA to NNA (@ in 
the diagram) and back (W in the diagram). 

In the following, we describe neuro-aware program analysis in more detail 
and elaborate on how the program analysis drives the neural-network analysis 
to verify safety properties of the containing heterogeneous system. Since the 
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Neural 
Network 
Analyzer 

(NNA) 


Program 
Analyzer 
(PA) 


y 


Neuro-Aware Program Analyzer (NPA) 


Safe / Unsafe 


Figure 1: Overview of neuro-aware program analysis. 


program analysis drives the neural-network analysis, we will explain the analysis 
in a top-down fashion by focusing on the program analysis before going into the 
details of the network analysis. In other words, our description of the program 
analysis assumes that we have a network analysis that over-approximates the 
behavior of the neural network. 


3.1 Neuro-Aware Program Analysis 


For our presentation, we assume imperative programs P with standard con- 
structs, such as loops, function calls, arithmetic, and pointer operations (our 
implementation targets LLVM bitcode). In addition, we assume a special func- 
tion call o := nn(ii,...,in) that calls a neural network with input parameters 
ii,...,in and returns the result of querying the network in return value o. We 
also assume that the query does not have side effects on the program state. We 
denote programs P augmented with special calls to neural networks as Pnn. 
We assume an abstract domain D consisting of a set of abstract elements d € 
D. Domain D is equipped with the usual binary operators (E, U, m, V A), where 
the ordering between elements is given by CE. Lp represents the smallest domain 
element and T p the largest (smallest and largest relative to the ordering imposed 
by E). The least upper bound (greatest lower bound) operator is denoted by U 
(N). As usual, if the abstract domain is not finite or the number of elements is 
too large, then we also assume the domain to be equipped with widening (V) 
and narrowing (/\) operators to ensure termination of the fixpoint computation. 
Moreover, we assume the abstract forget : D x V +> D operation that removes 
a set of variables from the abstract state, and its dual project: Dx V => D 
that projects the abstract state onto a set of variables. Finally, we assume the 
semantics function [.] : P+ DW D that, given a pre-state, computes the 
abstract semantics of a program to obtain its post-state; it does so recursively, 
by induction over the syntax of the program. We do not require that there exists 
a Galois connection [18] between the abstract domain D and the concrete domain 
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C. The only requirement is that D over-approximates C, i.e., [.J© C yo L-Joa 
where [.]© is the concrete semantics and y : D > C and a: C |> D are the 
concretization and abstraction functions, respectively. 


~ 


We can then trivially define [.] : Pan œ> D > D to deal with Pnn as follows: 


—— o:=nn(i1,...,in)](d) if Cmd = o := nn(it,..., in) 
[Cmd](d) = { ie otherwise 


tp ifd=tLp 
forget(d, o) otherwise 


[o := nn(ii,...,in)](d) = { 


— 


However, this definition of [.] is not very useful since it conservatively approxi- 
mates the neural network by havocking its return value o. 

To obtain a more precise approximation, we can integrate a designated neural- 
network analysis. Specifically, we view the neural-network analysis as another 
abstract domain D,,,, where, in practice, we do not require any other opera- 
tion from D,,, except the transfer function for o := nn(i1,...,in) that soundly 
approximates the semantics of the neural network (see Sect. 3.2 for more details): 


dij if d= Lp 
= ‘ ; _ J let dnn = convert(project(d,71,...,in)) in 
Jo := nn(i1,...,in)](d) = let dhn = [o :=nn(it,...,in)]D,, (dnn) in 
forget (d, 0) N convert ™! (dhn) otherwise 


Intuitively, this more precise transfer function performs the following steps 
(unless d is Lp). First, it converts from D to Dy, to invoke the transfer function 
of Dnn on the converted value dnn. It then havocs the return value o and conjoins 
the inferred return value after converting d/,,, back to D. In the above definition, 
functions convert : D œ> Dy, and convert™! : Dan +» D convert from one 
abstract domain (D) to the other (Dnn) and back. We allow for conversions to 
result in loss of precision, that is, Vz € D- æ E convert~!(convert(z)). 


It is important to realize here that the implementation of functions convert 
and convert~!, however precise, may still trigger a fatal loss of precision. After 
all, the abstract domains D and Dnn must also be expressive and precise enough 
to capture the converted values. For example, assume that, in a given program, 
the function call o := nn(ii,...,in) invokes a neural network to obtain the next 
move of a vehicle (encoded as a value from 0 to 7). Suppose the abstract return 
value d},,, is the set of moves {1,7}. In this case, given a domain D that cannot 
express these two moves as disjuncts, the implementation of function convert! 
has no choice but to abstract more coarsely; for instance, by expressing the 
return value as the interval [1,...,7]. Depending on the program, this may be 
too imprecise to prove safety. This happened to be the case for the RACETRACK 
variants we analyzed in Sect. 4, which is why we chose to use a disjunctive 
domain for the analysis; more specifically, we use Boxes [33], which allows us to 
track Boolean combinations of intervals. 
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Nevertheless, the considerations and approach described so far are still not 
precise enough for verifying the safety properties in the variants of RACETRACK 
that we consider. This is because the controller makes extensive use of (multi- 
dimensional) arrays, whose accesses are not handled precisely by the analysis. 
However, we observed that these arrays are initialized at the beginning of the 
system execution (for instance, to store maps indexed by x-y coordinates), and 
after initialization, they are no longer modified. Handling such array reads pre- 
cisely is crucial in our context since over-approximation could conservatively 
indicate that the vehicle may crash into an obstacle. 


Common array domains that summarize multiple elements using one or a 
small number of abstract elements fail to provide the needed precision. Even a 
fully expanded array domain [9] that separately tracks each array element loses 
precision if the index of an array read does not refer to a single array element; 
in such cases, the join of all overlapping elements will be returned. In addition, 
the Clang compiler—used to produce the LLVM bitcode that CRAB analyzes— 
desugars multi-dimensional arrays into single-dimensional arrays. This results in 
additional arithmetic operations (in particular, multiplications) for indexing the 
elements; these are also challenging to analyze precisely. 


Interestingly, to address these challenges, we follow the same approach as for 
neural networks, in other words, by introducing a designated and very precise 
analysis to handle reads from these pre-initialized arrays. More formally, we 
introduce a new statement to capture such reads, o := ar(ii,...,in), where ix is 
the index for the k-th dimension of an n-dimensional array. Note that this avoids 
index conversions for multi-dimensional arrays since indices of each dimension 
are provided explicitly. Moreover, it is structurally very similar to the nn(...) 
statement we introduced earlier. In particular, the specialized transfer function 
for D differs only in the two conversion functions and the specialized transfer 
function [.]p,,.: 


lp ifd= Lp 
oust 7 l _ J let dar = converta (project(d, i1,...,in)) in 
[o := ar(it, pij ia)](d) nal let dlr = [o = ar(it, oe in)] Do, (dar) in 
forget(d, o) N convertar (dhr) otherwise 


To keep this transfer function simple, its input is a set of concrete indices and 
its output a set of concrete values that are retrieved by looking up the indexed 
elements in the array (after initialization). This makes it necessary for convertar 
to concretize the abstract inputs to a disjunction of (concrete) tuples (iz,..., in) 
for the read indices. Similarly, convert;;! converts the disjunction of (concrete) 
values back to an element of domain D. 


Let us consider the concrete example in Fig. 2 to illustrate this more clearly. 
Line 1 initializes an array that is never again written to. On line 2, a non- 
deterministic value is assigned to variable idx, and the subsequent assume- 
statement constrains its value to be in the interval from 0 to 6. The assertion 
on line 5 checks that element elem, which is read from the array (on line 4), 
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int arr[] = {0, 1, 1, 2, 3, 5, 8, 13}; 


ny 

2 int idx = *; 

3 assume(O <= idx && idx <= 6); 
1 int elem = arr[idx]; 


5 assert(elem < 13); 


Figure 2: Example illustrating the specialized array domain. 


is less than 13. Let us assume that we want to analyze the code by combining 
the numerical Intervals domain with our array domain Dar; in other words, we 
assume D is instantiated with Intervals. In the pre-state of the array read, the 
analysis infers that the abstract value for idx is interval [0,6]. When computing 
the post-state for the read operation, the analysis converts this interval to the 
concrete set of indices {0,1,2,3,4,5,6} via convert,,. The transfer function for 
the array domain then looks up the (concrete) elements for each index to ob- 
tain the (concrete) set {0,1,2,3,5,8}. Before returning this set to the Intervals 
domain, the analysis applies convert;;! to obtain the abstract value [0,8]. This 
post-state allows the numerical domain to prove the assertion. 

Note that this array domain is not specific to controllers such as the one used 
in our RACETRACK variants. In fact, one could consider using it to more precisely 
analyze other programs with complex arrays that are initialized at runtime; a 
concrete example would be high-performance hash functions that often rely on 
fixed lookup tables. 

Even more generally, the domains we sketched above suggest that our ap- 
proach is also applicable to other scenarios; for instance, when a piece of code is 
too challenging to handle by a generic program analysis, and a simple summary 
or specification would result in unacceptable loss of precision. 


3.2 Neural-Network Analysis 


AT? [27| was the first tool and technique for verifying robustness of neural net- 
works using abstract interpretation. ERAN is a successor of AI?; it incorporates 
specialized transfer functions and abstract domains, such as DeepZ [56] (a variant 
of Zonotopes [28]) and DeepPoly [57] (a variant of Polyhedra [19]). Meanwhile, 
DEEPSYMBOL [41] extended AI? with a novel symbolic-propagation technique. 
In the following, we first provide an overview of the techniques in ERAN and 
DEEPSYMBOL. Then, we describe how their domains can be used to implement 
the specialized transfer function from D,, that was introduced in Sect. 3.1. On 
a high level, even though we are not concerned with robustness properties in 
this work, we re-purpose components of these existing tools to effectively check 
safety properties of heterogeneous systems that use neural networks. 

The main goal behind verifying robustness of a neural network is to provide 
guarantees about whether it is susceptible to adversarial attacks [31,12,50,44]. 
Such attacks slightly perturb an original input (e.g., an image) that is classified 
correctly by the network (e.g., as a dog) to obtain an adversarial input that 
is classified differently (e.g., as a cat). Given a concrete input (e.g., an image), 
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existing tools detect such local-robustness violations by expressing the set of all 
perturbed inputs (within a bounded distance from the original according to a 
metric, such as Lo [35]) and “executing” the neural network with this set of 
inputs to obtain a set of outcomes (or labels). The network is considered to be 
locally robust if there are no more than one possible outcome. 

Existing techniques use abstract domains to express sets of inputs and out- 
puts, and define specialized transfer functions to capture the operations (e.g., 
affine transforms and ReLUs) that are required for executing neural networks. 
For instance, ERAN uses the DeepPoly [57| domain that captures polyhedral 
constraints and incorporates custom transfer functions for affine transforms, Re- 
LUs, and other common neural-network operations. DEEPSYMBOL propagates 
symbolic information on top of abstract domains [65,41] to improve its precision. 
The key insight is that neural networks make extensive use of operations that ap- 
ply linear combinations of arguments, and symbolic propagation is able to track 
linear-equality relations between variables (e.g., activation values of neurons). 

Both ERAN and DEEPSYMBOL have the following in common: they define an 
abstract semantics for reasoning about neural-network operations and for com- 
puting an abstract set of outcomes from a set of inputs. We leverage this seman- 
tics to implement the specialized transfer function |o := nn(it,..., in)]p,, (dnn) 
from Sect. 3.1. 


4 Experimental Evaluation 


To evaluate our technique, we aim to answer the following research questions: 


RQ1: How effective is our technique in verifying goal reachability and crash 
avoidance? 

RQ2: How does the quality of the neural network affect the verification results? 

RQ3: How does a more complex benchmark affect the verification results? 

RQ4: How does the neural-network analyzer affect the verification results? 


4.1 Benchmarks 


We run our experiments on variants of RACETRACK, which is a popular bench- 
mark in the AI community [4,5,32,46,52,53] and implements the pseudo-code 
from Alg. 1 in C (see Sect. 2 for a high-level overview of the benchmark). 

The RACETRACK code”? is significantly more complicated than the pseudo- 
code in Alg. 1 would suggest; more specifically, it consists of around 400 lines of 
C code and invokes a four-layer fully connected neural network—with 14 inputs, 
9 outputs, and 64 neurons per hidden layer (using ReLU activation functions). 
To name a few sources of complexity, the currentState does not just denote a 
single value, but rather the position of the vehicle on the map, the magnitude 
and direction of its velocity, its distance to goal locations, and its distance to 
obstacles. As another example, the MOVE function runs the trajectory of the 
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vehicle from the old to the new state while determining whether there are any 
obstacles in between. 

For simplicity, the code does not use floating-point numbers to represent 
variables, such as position, velocity, acceleration, and distance. Therefore, the 
program analyzer does not need to reason about floating-point numbers, which 
is difficult for CRAB and most other tools. However, this does not semantically 
affect the neural network or its analysis, both of which do use floats. An inter- 
face layer converts the input features, tracked as integers in the controller, to 
normalized floats for the neural-network analysis. The output from the neural- 
network analysis is a set of floating-point intervals, which are logically mapped 
to integers representing discrete possible actions at a particular state. 

We evaluate our approach on 26 variants of RACETRACK, which differ in the 
following aspects of the analyzed program or neural network. 

Maps. We adopt three RACETRACK maps of varying complexity from related 
work [5,32], namely barto-small (BS) of size 12 x 35, barto-big (BB) of size 30 x 
33, and ring (R) of size 45 x 50. The size of a map is measured in terms of 
its dimensions (i.e., width and height). The map affects not only the program 
behavior, but also the neural network that is invoked. The latter is due to the 
fact that we train custom networks for different maps. 

Neural-network quality. The neural network (line 9 of Alg. 1) is trained, using 
reinforcement learning [60], to predict an acceleration given a vehicle state, that 
is, the position of the vehicle on the map, the magnitude and direction of its 
velocity, its distance to goal locations, and its distance to obstacles. As expected, 
the quality of the neural-network predictions depends on the amount of training. 
In our experiments, we use well (GOOD), moderately (MOD), and poorly (POOR) 
trained neural networks. We use the average reward at the end of the training 
process to control the quality. More details are provided in RQ2. 

Noise. We complicate the RACETRACK benchmark by adding two sources of 
non-determinism, namely environment (ENV) and neural-network (NN) noise. 
Introducing such noise is common practice in reinforcement learning, for in- 
stance, when modeling real-world imperfections, like slippery ground. 

When environment noise is enabled, the controller might zero the vehicle 
acceleration (in practice, with a small probability), instead of applying the ac- 
celeration predicted by the neural network for the current vehicle state. This 
source of non-determinism is implemented by the if-statement on line 6 of Alg. 1. 
Environment noise may be disabled for states that are too close to obstacles to 
allow the vehicle to avoid definite crashes by adjusting its course according to 
the neural-network predictions. The amount of environment noise is, therefore, 
controlled by the distance to an obstacle (OD) at which we disable it. For ex- 
ample, when OD = 3, environment noise is disabled for all vehicle states that 
are at most 3 cells away from any obstacle. Consequently, when OD = 1, we 
have a more noisy environment. Note that we do not consider OD = 0 since the 
environment would be too noisy to verify safety for any start state. 

Note that environment noise is not meant to represent realistic noise, but 
rather to make the verification task more challenging. However, it is also not 
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entirely unrealistic and can be viewed as “necessarily rectifying steering course 
close to obstacles”. Non-deterministically zeroing acceleration is inspired by re- 
lated work [32]. 

For a given vehicle state, the neural network computes confidence values for 
each possible acceleration; these values sum up to 1. Normally, the predicted 
acceleration is the one with the largest confidence value, which however might 
not always be high. When neural-network noise is enabled, the network analyzer 
considers any acceleration for which the inferred upper bound on the confidence 
value is higher than a threshold e. For example, when e€ = 0.25, any acceleration 
whose inferred confidence interval includes values greater than 0.25 might be 
predicted by the neural network. Consequently, for lower values of e, the neural 
network becomes more noisy. Such probabilistic action selection is widely used 
in reinforcement learning [55]. 

Each of these two sources of noise—ENV and NN noise—renders the verifica- 

tion of a neural-network controller through enumeration of all possible execution 
paths intractable: due to the non-determinism, the number of execution paths 
from a given initial state grows exponentially with the number of control itera- 
tions (e.g., the main loop on line 5 of Alg. 1). In our RACETRACK experiments, 
the bound on the number of loop iterations is 50, and as a result, the number 
of execution paths from any given initial state quickly becomes very large. By 
statically reasoning about sets of execution paths, our approach is able to more 
effectively handle challenging verification tasks in comparison to exhaustive enu- 
meration. 
Lookahead functionality. We further complicate the benchmark by adding 
lookahead functionality (not shown in Alg. 1), which aims to counteract incorrect 
predictions of the neural network and prevent crashes. In particular, when this 
functionality is enabled, the controller simulates the vehicle trajectory when 
applying the acceleration predicted by the neural network a bounded number of 
additional times (denoted LA). For example, when LA = 3, the controller invokes 
the neural network 3 additional times to check whether the vehicle would crash 
if we were to consecutively apply the predicted accelerations. If this lookahead 
functionality indeed foresees a crash, then the controller reverses the direction of 
the acceleration that is predicted for the current vehicle state on line 9 of Alg. 1. 
Conceptually, the goal behind our lookahead functionality is similar to the one 
behind shields [2]. While lookahead is explicitly encoded in the program as code, 
shields provide a more declarative way for expressing such safeguards. 


4.2 Implementation 


For our implementation!!, we extended CRAB to support specialized abstract 


domains as described in Sect. 3. To integrate DEEPSYMBOL and ERAN, we im- 
plemented a thin wrapper around these tools to enable their analysis to start 


a Our source code can be found at  https://github.com/Practical-Formal-Methods/ 
clam-racetrack and an installation at https://hub.docker.com/r/practicalformalmethods/ 
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Figure 3: Verification results for RQ1, where ENV(OD = 3). The maps on the 
left are BS (top) and BB (bottom), and the map on the right is R. 


from a set of abstract input states and return a set of abstract output states. 
Moreover, our wrappers provide control over the amount of neural-network noise 
(through threshold e€). 


4.3 Setup 


We use deep Q-learning [47] to train a neural network for each RACETRACK 
variant. We developed all training code in Python using the TensorFlow!” and 
Torch? deep-learning libraries. 

We configure CRAB to use the Boxes abstract domain [33], DEEPSYMBOL to 
use Intervals [18] with symbolic propagation [41], and ERAN to use DeepPoly [57]. 
When running the analysis, we did not specify a bound on the available time or 
memory; consequently, none of our analysis runs led to a time-out or mem-out. 
Regarding time, we report our results in the following, and regarding memory, 
our technique never exceeded 13.5GB when analyzing all start states of any map. 

We performed all experiments on a 48-core Intel @ Xeon @® E7-8857 v2 
CPU @ 3.6GHz machine with 1.5TB of memory, running Debian 10 (buster). 


4.4 Results 


We now present our experimental results for each research question. 
RQ1: How effective is our technique in verifying goal reachability and 
crash avoidance? To evaluate the effectiveness of our technique in proving these 
system properties, we run it on the following benchmark variants: BS, BB, and R 
maps, GOOD neural networks, ENV noise with OD = 1,2,3, and LA = 0 (i.e., no 
lookahead). The verification results are shown in Figs. 3, 4, and 5 (see Sect. 2 for 
the semantics of cell colors). These results are achieved when combining CRAB 
with DEEPSYMBOL, but the combination with ERAN is comparable (see RQ4). 
As shown in Fig. 3, for the vast majority of initial vehicle states, our technique 
is able to verify goal reachability and crash avoidance. This indicates that our 
integration of the controller and neural-network analyses is highly precise. As 


12 https: //www.tensorflow.org 
t3 http://torch.ch/ 
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Figure 4: Verification results for RQ1, where ENV(OD = 2). 


Figure 5: Verification results for RQ1, where ENV(OD = 1). 


Table 1: Performance results for RQ1. 


Map| NN Norse LA NN TOTAL | Ave |NN Ave 
ANALYZER | TIME | TIME| TIME 
BS |GOOD|ENV(oD = 3)| 0 |DEEPSYMBOL|1h20m34s|14m53s| 30.2% 
BB |GOODJENV(OD = 3)| 0 |DEEPSYMBOL|3h52m38s|18m55s| 16.1% 
R |GOOD/ENV(oD = 3)| 0 |DEEPSYMBOL|2h58m17s|11m33s| 26.6% 


expected, the more ENV noise we add (i.e., the smaller the OD values), the fewer 
states we prove safe (see Figs. 4 and 5). 

Tab. 1 shows the performance of our technique. The first four columns of 
the table define the benchmark settings, the fifth the neural-network analyzer, 
and the last three show the total running time of our technique for all start 
states, the average time per state, and the percentage of this time spent on the 
neural-network analysis. Note that we measure the total time when running the 
verification tasks (for each start state) in parallel'*; the average time per state 
is independent of any parallelization. We do not show performance results for 
different OD values since environment noise does not seem to have a significant 
impact on the analysis time. 

Recall from Sect. 2 that, without our technique, it is currently only possible to 
verify properties of a heterogeneous system like RACETRACK by considering the 
controller in isolation, ignoring the call to the neural network, and havocking 
its return value. We perform this experiment for all of the above benchmark 
variants and find that CRAB alone is unable to prove goal reachability or crash 


14 nttps: //doi .org/10.5281/zenodo. 1146014 
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Figure 6: Verification results for RQ2, with MOD neural networks. 


Figure 7: Verification results for RQ2, with POOR neural networks. 


avoidance for any initial vehicle state; in other words, all states are red. This is 
the case even when replacing Boxes with Polyhedra—these two domains perform 
the most precise analyses in CRAB. 

RQ2: How does the quality of the neural network affect the verifi- 
cation results? To evaluate this research question, we run our technique on 
the following benchmark variants: BS, BB, and R maps, MOD and POOR neural 
networks, ENV noise with OD = 3, and LA = 0. The verification results are shown 
in Figs. 6 and 7; they are achieved by combining CRAB with DEEPSYMBOL. 

In deep Q-Learning (see Sect. 4.3), a neural network is trained by assigning 
positive or negative rewards to its predictions. A properly trained network learns 
to collect higher rewards over a run. Given this, we assess the quality of networks 
by considering average rewards over 100 runs from randomly selected starting 
states. If the network collects more than 70% of the maximum achievable reward, 
we consider it a GOOD agent. If it collects ca. 50% (or respectively, ca. 30%) of 
the maximum reward, we consider it a MOD (respectively, POOR) agent. 

In comparison to Fig. 3, our technique proves safety of fewer states since 
the quality of the networks is worse. Analogously, more states are verified in 
Fig. 6 than in Fig. 7. Interestingly, for BB, our technique proves crash avoidance 
(blue cells) more often when using a POOR neural network (Fig. 7) instead of a 
MOD one (Fig. 6). We suspect that this is due to the randomness of the training 
process and the training policy, which penalizes crashes more than getting stuck; 
so, a POOR neural network might initially only try to avoid crashes. 

Regarding performance, the analysis time fluctuates when using MOD and 
POOR neural networks. There is no pattern even when comparing the time across 
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Figure 8: Verification results for RQ3, where LA = 0,1, 3 from left to right. 


Table 2: Performance results for RQ3. 


MaplINN| Norse ILA NN TOTAL Ave |NN Ave 
ANALYZER | TIME TIME TIME 
BS |MOD|ENV(OD = 3)| 0 |DEEPSYMBOL/2h27m53s}| 27m35s] 45.0% 
BS |MOD|ENV(OD = 3)| 1 |DEEPSYMBOL/8h04m40s]1h12m20s] 14.9% 
BS |MOD|ENV(OD = 3)| 3 |DEEPSYMBOL|9h30m14s|]1h47m05s} 11.49% 


different map sizes for equally trained networks. This is to be expected as neural 
networks may behave in unpredictable ways when not trained properly (e.g., the 
vehicle may drive in circles), which affects the performance of the analysis. 
RQ3: How does a more complex benchmark affect the verification 
results? We complicate the benchmark by adding lookahead functionality, i.e., 
resulting in LA additional calls to the neural network per vehicle move (see 
Sect. 4.1 for more details). Since well trained neural networks would benefit less 
from this functionality, we use MOD networks in these experiments. In particular, 
we run our technique on the following benchmark variants: BS map, MOD neural 
networks, ENV noise with OD = 3, and LA = 0,1,3. The verification results are 
shown in Fig. 8; they are achieved by combining CRAB with DEEPSYMBOL. 

As LA increases, the benchmark becomes more robust, yet more complex. We 
observe that, for larger values of LA, our technique retains its overall precision 
despite the higher complexity; e.g., there are states that are verified with LA = 3 
or 1 but not with 0. However, there are also few states that are verified with 
LA = 1 but not with 3. In these cases, the higher complexity does have a negative 
impact on the precision of our analyses. 

Tab. 2 shows the performance of our technique for these experiments. As 
expected, the analysis time increases as the benchmark complexity increases. 
RQA4: How does the neural-network analyzer affect the verification re- 
sults? We first compare DEEPSYMBOL with ERAN on the following benchmark 
variants: BS, BB, and R maps, GOOD neural networks, ENV noise with OD = 3, 
and LA = 0. The verification results achieved when combining CRAB with ERAN 
are shown in Fig. 9; compare this with Fig. 3 for DEEPSYMBOL. 

We observe the results to be comparable. With DEEPSYMBOL, we color 216 
cells green and 1 blue for BS, 455 green for BB, and 499 green and 6 blue for R. 
With ERAN, the corresponding numbers are 214 cells green and 7 blue for Bs, 
459 green and 4 blue for BB, and 485 green and 71 blue for R. We observe similar 
results for other benchmark variants, but we omit them here. 

Comparing the two neural-network analyzers becomes more interesting when 
we enable NN noise. More specifically, we run our technique on the following 
benchmark variants: BS, BB, and R maps, GOOD networks, NN noise with e€ = 0.25, 
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Figure 9: Verification results for RQ4 with ERAN, where ENV(O 


Figure 10: Verification results for RQ4 with DEEPSYMBOL, where NN(€ = 0.25). 


Figure 11: Verification results for RQ4 with ERAN, where NN(€ = 0.25). 


and LA = 0. Fig. 10 shows the verification results when combining CRAB with 
DEEPSYMBOL, and Fig. 11 when combining CRAB with ERAN. 


As shown in the figures, the verification results are slightly better with ERAN. 
In particular, with DEEPSYMBOL, we color 170 cells green for BS, 109 green for 
BB, and 195 green for R. With ERAN, the corresponding numbers are 181 cells 
green for BS, 131 green for BB, and 203 green for R. Despite this, the perfor- 
mance of our technique can differ significantly depending on whether we use 
DEEPSYMBOL or ERAN, as shown in Tab. 3. One could, consequently, imagine a 
setup where multiple neural-network analyzers are run in parallel for each veri- 
fication task. If time is of the essence, we collect the results of the analyzer that 
terminates first. If it is more critical to prove safety, then we could combine the 
results of all analyzers once they terminate. 


Automated Safety Verification of Programs Invoking Neural Networks 219 


Table 3: Performance results for RQ4. 


Map| NN Norse |LA NN Tota | Ave |NN Ave 

ANALYZER | TIME | TIME| TIME 
BS |GOoD|ENV(oD = 3)| 0 |DEEPSYMBOL|1h20m34s|14m53s| 30.2% 
BS |GOODJENV(OD = 3)| 0 ERAN 43m21s} 8m1l1s| 38.4% 
BB |GOODJENV(OD = 3)| 0 |DEEPSYMBOL|3h52m38s/18m55s|_ 16.1% 
BB |GOOD|ENV(OD = 3)| 0 ERAN 3h26m17s)16m42s| 57.2% 
R |GOODJENV(OD = 3)| 0 |DEEPSYMBOL]2h58m17s|11m33s]_ 26.6% 
R  |GOODJENV(OD = 3)| 0 ERAN 4h38m03s|18m18s} 53.8% 
BS |GOOD|NN(e = 0.25)| 0 |DEEPSYMBOL|1h26m09s|15m41s] 36.7% 
BS |GOOD|NN(€ = 0.25)| 0 ERAN 45m37s| 8m24s| 45.0% 
BB |GOOD|NN(e = 0.25)| 0 |DEEPSYMBOL|2h52m50s/13m24s| 20.7% 
BB |GOOD|NN(€ = 0.25)| 0 ERAN 2h59m48s|14m10s] 64.7% 
R  |GOOD|NN(e = 0.25)| 0 |DEEPSYMBOL|2h01m18s| 7m57s| 26.4% 
R |GOOD|NN(e = 0.25)] 0 ERAN 3h21m32s]13m11s} 54.3% 


5 Related Work 


The program-analysis literature provides countless examples of powerful analysis 
combinations. To name a few, dynamic symbolic execution [29,10] and hybrid 
fuzzing [45,58,69] combine random testing and symbolic execution, numerous 
tools broadly combine static and dynamic analysis [6,20,21,49,30,13,14,22,61], 
and many tools combine different types of static analysis [7,1,34]. In contrast 
to neuro-aware program analysis, almost all these tools target homogeneous, 
instead of heterogeneous, systems. CONCERTO [61] is a notable exception that 
targets applications using frameworks such as Spring and Struts. It combines 
abstract and concrete interpretation, where, on a high level, concrete interpreta- 
tion is used to analyze framework code, whereas abstract interpretation is used 
for application code. Instead of building on existing analyzers, as in our work, 
CONCERTO introduces a designated technique for analyzing framework code. 

There is recent work that focuses specifically on verifying hybrid systems with 
DNN controllers [24,59]. Unlike in our work, they do not analyze programs that 
interact with the network, but models; in one case, ordinary differential equations 
describing the hybrid system [24], and in the other, a mathematical model of 
a LiDAR image processor [59]. In this context of hybrid systems with DNN 
controllers, there is also work that takes a falsification approach to the same 
problem [62,70,23]. They generate corner test cases that cause the system to 
violate a system-level specification. Moreover, existing reachability analyses for 
neural networks [25,42,67,68] consider linear or piecewise-linear systems, instead 
of programs invoking them. 

Kazak et al. [39] recently proposed Verily, a technique for verifying systems 
based on deep reinforcement learning. Such systems have been used in various 
contexts, such as adaptive video streaming, cloud resource management, and In- 
ternet congestion control. Verily builds on Marabou [88], a verification tool for 
neural networks, and aims to ensure that a system achieves desired service-level 
objectives (expressed as safety or liveness properties). Other techniques use ab- 
stract interpretation to verify robustness [27,57,41] or fairness properties [63] of 
neural networks. Furthermore, there are several existing techniques for check- 
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ing properties of neural networks using SMT solvers [37,38,36] and global opti- 
mization techniques [54]. In contrast to our approach, they focus on verifying 
properties of the network in isolation, i.e., without considering a program that 
queries it. However, we re-purpose two of the above analyzers [57,41] to infer 
invariants over the neural-network outputs. Gros et al. [32] make use of statis- 
tical model checking to obtain quality-assurance reports for a neural network 
in a noisy environment. Their approach provides probabilistic guarantees about 
checked properties, instead of definite ones like in our work, and also does not 
analyze a surrounding system. 


6 Conclusion 


Many existing software systems are already heterogeneous, and we expect the 
number of such systems to grow further. In this paper, we present a novel ap- 
proach to verifying safety properties of such systems that symbiotically combines 
existing program and neural-network analyzers. Neuro-aware program analysis 
is able to effectively prove non-trivial system properties of programs invoking 
neural networks, such as the 26 variants of RACETRACK. 
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Abstract. We present a scalable and precise verifier for recurrent neu- 
ral networks, called PROVER based on two novel ideas: (i) a method to 
compute a set of polyhedral abstractions for the non-convex and non- 
linear recurrent update functions by combining sampling, optimization, 
and Fermat’s theorem, and (ii) a gradient descent based algorithm for 
abstraction refinement guided by the certification problem that combines 
multiple abstractions for each neuron. Using PROVER, we present the first 
study of certifying a non-trivial use case of recurrent neural networks, 
namely speech classification. To achieve this, we additionally develop cus- 
tom abstractions for the non-linear speech preprocessing pipeline. Our 
evaluation shows that PROVER successfully verifies several challenging 
recurrent models in computer vision, speech, and motion sensor data 
classification beyond the reach of prior work. 


Keywords: Robustness verification - Polyhedral abstraction - 
Recurrent neural networks - Long short-term memory - Abstraction 
refinement - Speech classifier verification 


1 Introduction 


Recurrent neural networks (RNNs) are widely used to model long-term dependen- 
cies in lengthy sequential signals [11,27,43]. Prior work has demonstrated the sus- 
ceptibility of RNNs to adversarial perturbations of its inputs [28], exposing secu- 
rity vulnerabilities of state-of-the-art RNNs when used in domains such as speech 
recognition [8,22], malware detection [16], and others. Thus, verifying the robust- 
ness of recurrent architectures is critical for their safe deployment. While there has 
been considerable interest in certifying the robustness of feedforward image classi- 
fiers [4,12,13, 23,32,37,39,47], less attention has been given to recurrent architec- 
tures. Asa result, current certification solutions do not scale beyond simple models 
and datasets, which limits their practical applicability. Further, there has been no 
work on verifying real-world use cases of RNNs. In this paper, we address both of 
these challenges and present the first precise and scalable verifier for RNNs based 
© The Author(s) 2021 
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Fig. 1. Certification of recurrent architectures using PROVER: utterance “stop” with per- 
turbations is correctly classified. Possible perturbations are captured and propagated 
through the system, then refined backward for improved precision. (Color figure online) 


on abstract interpretation [10], which enables us to certify robustness of realistic 
speech recognition systems. 

We illustrate the problem setting and overall flow in Fig. 1. Here, a speech recog- 
nition model based on the Long Short-Term Memory (LSTM) architecture [15] 
receives a signal encoding the utterance of “stop” by a human. As such models 
are usually employed in noisy environments, they must robustly classify variations 
(e.g., voice changes) to the utterance “stop”. However, recent work [8] has shown the 
model may be fooled into classifying the utterance as “go”. It is important to prove 
such mis-classifications are not possible, thus avoiding a potential exploitation by 
an adversary, for instance in automated traffic control settings (which can lead to 
accidents). Our goal is to design a verifier that can formally establish the robust- 
ness of such models against noise-induced perturbations. We focus on LSTMs, as 
they are the most widely used form of RNNs, but our methodology can be easily 
extended to other architectures (e.g., Gated Recurrent Unit (GRU) [9]). Figure 1 
shows how our proposed verifier, called PROVER, (Polyhedral Robustness Verifier 
of RNNs) automatically verifies the robustness of the model. Here, the labeled 
rectangles represent operations in the network. The “Preprocess” box captures 
domain-specific pre-processing operations (typically present when using RNNs, 
e.g., speech processing). In our method, we first compute a polyhedral abstraction 
capturing all speech signals given as input to the model under the given pertur- 
bation budget. At each timestep 7, the pre-processing operation receives a polyhe- 
dron s“ and produces an output polyhedron «™. This shape is then propagated 
symbolically through the LSTM and the post-processing stage, resulting in a poly- 
hedral output shape, denoted as z (blue shape in Fig. 1). 


Key Challenge: Polyhedral Abstractions for LSTMs. The main challenge 
in certifying LSTMs is the design of precise and scalable polyhedral abstract 
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transformers for the non-linear operations employed in LSTMs: given a polyhe- 
dral shape capturing hidden states h“—, to produce the shape capturing the 
next set of hidden states h®. A recent method [21] computes this based on 
gradient-based optimization but suffers from two main limitations. First, the 
optimization procedure is computationally expensive and does not scale to real- 
istic use cases. Second, the method lacks convergence and optimality guarantees. 
To address these issues, we introduce a novel technique based on a combination 
of sampling, linear programming, and Fermat’s theorem [1], which significantly 
improves the precision and scalability compared to prior work [21], while offering 
asymptotic guarantees of convergence towards the optimal solution. 


Refinement via Optimization. To certify robustness, we must verify that 
each concrete point in the output shape z corresponds to the correct label “stop”. 
However, z can contain, due to over-approximation, spurious incorrect concrete 
points (it intersects the red region representing incorrect outputs). To address 
this issue, we form a loss based on the output shape, backpropagate the gra- 
dient of this loss through the timesteps and adjust the polyhedral abstractions 
in each LSTM unit to decrease the loss. The goal is to refine the abstraction, 
guided by the certification task. We illustrate this process in Fig.1 using the 
purple backward arrow with the refined polyhedral abstraction shown in purple. 
Using the refined abstraction, the new output shape z’ (purple polygon) lies com- 
pletely inside the green region of the output space, meaning it provably contains 
only correct output vectors (corresponding to “stop”), and hence certification 
succeeds. Overall, our method significantly increases the precision of end-to-end 
RNN certification without introducing high runtime costs. 


Key Contributions. Our main contributions are: 


— A new and efficient method to certify the robustness of RNNs to adversarial 
perturbations. Our method relies on novel polyhedral abstractions for han- 
dling non-linear operations in these architectures. 

— A novel method that automatically refines the abstraction for each input 
example being certified guided by the certification task. 

— An implementation of the method in a system called PROVER and evaluation 
on several benchmarks and datasets. Our results show that PROVER is precise 
and scales to larger models than prior work. PROVER is also the first verifier 
able to certify realistic RNN-based speech classifiers. The code is available in 
https: //github.com/eth-sri/prover. 


2 Related Work 


While the first adversarial examples for neural networks were found in computer 
vision [6,41], recent work also showed the vulnerability of RNNs [28]. Modern 
speech recognition systems, based on RNNs, were shown susceptible to small 
noise crafted by an adversary using white-box attacks [7,8], achieving a 100% 
success rate against DeepSpeech [14], a state-of-the-art speech-to-text engine. 
These were later followed by attacks based on universal perturbation [26] and 
temporal dependency [46]. Recent work [22,31] demonstrates that adversarial 
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examples for audio classifiers are realizable in the real-world. While giving an 
empirical estimate of the vulnerability of RNNs, these works do not provide any 
formal guarantees, which is the goal of our work. 

There have also been recent works on the verification of RNNs. [2] propose 
the certification of RNNs based on mixed-integer linear programming, which only 
works for ReLU-based networks and does not consider LSTMs, which use sigmoid 
and tanh activations. [45] propose an input discretization method to certify video 
models that are a combination of CNNs and RNNs. However, discretization does 
not scale to the perturbations we consider in our work. [18] propose to verify 
RNNs by automatically inferring temporal homogeneous invariants using binary 
search. However, their approach is limited to vanilla RNNs and does not apply to 
the more commonly used LSTM networks considered in this work. [19] propose 
the statistical variant of Angulin’s algorithm [3] for probabilistic verification and 
counterexample generation for RNNs, however they cannot provide deterministic 
guarantees as our work. The work most related to ours is POPQORN [21] which 
uses expensive gradient-based optimizations for every operation in the network. 
We experimentally show that it does not scale to practical applications such as 
speech classification. 


3 Background 


We first define the threat model and then present all operations that are part of 
the verification procedure, including speech preprocessing and LSTM updates. 


3.1 Threat Model 


We use a threat model based on the L..-norm, where an attacker can change 
each element of a correctly classified input vector s by an amount < e € R 
[8]. Therefore, our input region can be represented as a conjunction of intervals 
[s; — €, si + €], where s; is the i-th element of s. The measure of signal distortion 
in this setting are decibels (dB) defined as: 


dB(s) = max 20 - logio(l|s:|); d€Bs(d) = dB(d) — dB(s) 


The quieter the perturbation is, the smaller dB,(6) is. We fix the dB,(6) =: € 
as dB perturbation and focus on verifying that the model classifies correctly all 
signals s’ possible under our threat model. 


3.2 Long Short-Term Memory (LSTM) 


LSTM architectures [15] are popular for handling sequential data as they can uti- 
lize long-term dependencies. These dependencies are passed through time using 
two state vectors for the timestep t: cell state ec and hidden state h“). These 
state vectors are updated using the following formulas: 
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Fig. 2. LSTM cell: FO, of), and a) represent the pre-activated gates. 
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where [-,-] is the horizontal concatenation of two row vectors, W. and b. are the 
kernel and bias of the cell, respectively, and ø is the sigmoid function. At timestep 
t, vectors FO, ©, e represent pre-activations of the forget gate, input 
gate, output gate and the candidate gate, respectively. We show an illustration 
of an LSTM cell in Fig. 2. We treat o and tanh as forms of activation functions, 
which is why we define the LSTM using pre-activations. 

Intuitively, the input gate transforms the input vector, the forget gate filters 
the information from the previous cell state, the candidate gate prepares the 
candidate cell state, and the output gate transforms the current hidden state. 
All of these gates receive as input the hidden state h{~!) of the previous cell 
and the input æ) representing the current frame. This recurrent architecture 
allows inputs with arbitrary length, enabling LSTMs to handle temporal data, 
e.g., speech processing. 


3.3 Speech Preprocessing 


Though there have been various works that operate directly on the raw signal 
[29,36], speech signals are commonly preprocessed using the filterbank or log Mel- 
filterbank energy methods. The result is a vector of coefficients whose elements 
contain log-scaled values of filtered spectra, one for every Mel-frequency. This 
method models the non-linear human acoustic perception as power spectrum 
filters based on Mel-frequencies. The input signal is split into several (possibly 
overlapping) frames for granular analysis, and the following steps are applied: 


1. Pre-emphasizing and windowing are preprocessing stages on the raw signals. 
Speech signals tend to have larger and smoother low-frequency samples and 
smaller and fluctuating high-frequency samples. Pre-emphasizing is a process 
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of subtracting the adjacent sampled values multiplied by a scalar parameter 
ae -as\),, commonly a = 0.97). This alleviates the unbalanced distribution 
of signal strength along with the frequency. Windowing involves multiplica- 
tion of each sampled value and ‘windows’ according to their indices. The 
window here refers to a Hamming window, which is a bell-like curve with 
peak in the middle of the frame and drops at the side. It reduces the border 
effects on each frame by suppressing the values near the border with smaller 
values. 

2. Power spectrum of Fast Fourier transform (FFT) performs the discrete 
Fourier transform (DFT) and obtains the squared norm of each element to 
obtain intensities in the frequency domain. FFT consists of matrix multipli- 
cations with complex entries. We modify it to use only real numbers by: (i) 
separating real and imaginary parts of the matrix and constructing two sepa- 
rate matrices, (ii) multiplying each matrix with the signal, (iii) squaring the 
entries, and (iv) adding the resulting matrices entry-wise. 

3. Mel-filter bank log energy: The Mel-frequency filters are triangular, each 
emphasizing the power of the selected frequency and suppressing the adja- 
cent ones. In our case, we (i) apply the Mel-filterbank to the power spectrum 
and (ii) take the log of the entries to adjust the level. 


Following [35], each step can be represented as a distinct matrix operation. It 
allows us to decompose and rearrange the steps into slightly different stages: 


1. Pre-square stage: S => Y = SM,. This stage contains pre-emphasizing, win- 
dowing (step 1), and FFT (until step 2-(ii)). All operations are representable 
as matrix multiplications, so we pre-calculate the product matrix. 

2. Square stage: Y — 0 = Y © Y. This is step 2-(iii). Entry-wise square opera- 
tions cannot be combined with other matrix multiplications. 

3. Pre-log stage: 0 > X = 0M3. From step 2-(iv) through step 3-(i). We 
combine the operations into a single matrix. 

4. Log stage: X + X = log X. Applying entry-wise logarithm (step 3-(ii)). 


We use the resulting X = [a ... g(T)\" as the input to the neural network. 


3.4 Verification Using DeepPoly Abstract Domain 


DeepPoly [39] is a sub-polyhedral abstract domain that associates a lower and 
an upper polyhedral bound and interval bounds per neuron. It is faster than 
Polyhedra [40] and more precise than other weakly relational domains such as 
Octagons [25], Zones [24], and Zonotopes [38] when analyzing neural networks. 
Previously, it has been suceesfully applied for verifying feedforward networks 
in [4,39]. Formally, let ¥ = {21,22,...,2n} be an ordered set of neurons such 
that the neurons in layer l appear before the neurons of layer I’ > l. DeepPoly 
associates with each neuron xj, both interval l; < x; < uj and polyhedral 
bounds er aixi +b < zj < per, a; -xi +b' where lj, uj, ai, a}, bb’ E RU {oo}. 
DeepPoly is exact for affine transformations which are frequently applied both in 
the speech preprocessing pipeline and the LSTM unit. DeepPoly loses precision 
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for the non-linear operations in LSTMs. We note that computing polyhedral 
bounds on their output is more challenging than for feedforward networks. 

The precision of the DeepPoly approximation for the non-linear operations 
depends on the tightness of the interval bounds of the neurons that are input 
to the non-linear operations. DeepPoly provides a scalable and precise method 
called backsubstitution for optimizing a linear expression within a region defined 
by the set of DeepPoly constraints. It does so by recursively substituting the 
bounding linear expressions of target neurons with the polyhedral bounds of 
previous layers’ neurons until reaching the input neurons. It then uses the con- 
crete bounds of the input neurons for computing the result. Backsubstitution is 
used for computing the interval bounds of neurons input to the non-linear oper- 
ations as well as for bounding the difference between the neurons in the output 
layer needed to prove robustness. We refer the reader to [39] for details of the 
backsubstitution. 


4 Overview of PROVER 


This section illustrates the workings of PROVER on a small example. Our goal is 
to certify the robustness of a single LSTM cell on the input x € [—1.2, 1.2]. For 
this example, we assume that there are two output classes and all intermediate 
LSTM gates {i, f, č, o} share the same weights and biases: 


micos E #4 Hl , e=o(t)@tanh(é),  A=c(0) ©tanh(c). 
The correct output here is hə and to certify robustness we need to prove that 
hə — hy > 0 holds for all inputs x. In other words, min hz — hı > 0. 


Polyhedral Abstraction. We build our verifier based on the DeepPoly [39] 
abstraction since DeepPoly outperforms the interval analysis and other compet- 
itive domains, as Sect.3.4 states. 


Challenges in Computing Polyhedral Bounds for LSTMs. The composed 
binary non-linear operations applied in LSTMs such as o(x) tanh(y) and o(x)y 
are significantly more complex to handle than the ReLU, Sigmoid, and Tanh 
activations originally handled by [39]. This is because the non-linear operations 
in LSTMs mentioned above involve transcendental functions yielding non-linear 
3D curves that are neither convex nor concave. The optimal polyhedral bounds 
for these operations have no closed-form solution and cannot be calculated by 
simple geometry or algebra. Further, obtaining such bounds is computation- 
ally expensive [21]. For example, obtaining the lower linear plane for bounding 
o(x)tanh(y) is equivalent to solving a Lagrangian with 6 variables - 3 linear 
coefficients, 2 interval-bounded coordinates and 1 Lagrange multiplier for the 
constraint. In contrast, the optimal polyhedral bounds for ReLU, Sigmoid, and 
Tanh have closed form solutions, easily visualized in 2D. 


Precise Polyhedral Bounds via LP. To overcome these challenges, we pro- 
pose a generic approach based on linear programming (LP) to compute precise 
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polyhedral bounds. We illustrate our approach for calculating a lower polyhe- 
dral bound of h2 = o(02) tanh(cz). First, we calculate the concrete intervals for 
the two target variables via backsubstitution [39], briefly described in Sect. 3.4. 
In our case, the target variables are o2 and c2 and the backsubstitution yields 
02 € [0.4,1.6] and c2 € [—0.79, 0.62]. Our abstraction can represent the affine 
transformations exactly. Therefore, we obtain the exact interval for 02 = 0.5-7+1 
via the backsubstitution whereas the obtained interval for c2 is an overapproxi- 
mate one. Then, we uniformly sample a set of points {(x1, y1),---; (En, Yn) } from 
the input domain [0.4, 1.6] x [—0.79, 0.62]. We solve the following optimization 
problem to calculate the lower polyhedral bound of he: 


Bed (a(x) tanh(y;) — (Ar: zi + Bi- yi +Ci)), 
subject to the constraint that A; - x; + Bı - yi + Cı < o(ax;) tanh(y;) for each 
i. This is a linear program over three variables (A;, Bı, C1) that can be solved 
efficiently in polynomial time. However, the obtained bound may not be sound as 
the sampled points do not fully cover the continuous input domain. To address 
this, we shift the plane downwards by an offset (decreasing C) equal to the 
maximum violation between A; -x+ B;-y+C; and hz based on Fermat’s theorem. 
After solving the linear program and the adjustment, we obtain A; = 0.04, Bı = 
0.46,C;, = 0.01 which results in the following lower polyhedral bound to hə: 
hg > LB, = 0.04 + 02 + 0.46 - co + 0.01. We compute the upper bound to 
hg: ha < U Bp, analogously. After computing a polyhedral abstraction of each 
neuron, we calculate the lower bound of hz — hı via backsubstitution as follows: 


min ha —h; > LB, — U Bm: 
> (0.04 - o2 + 0.46 - co + 0.01) — (—0.09 - o1 + 0.66 - cı + 0.14) 
> 0.04 - o2 + 0.46 - (0.07 - i2 + 0.27 - go + 0.09) 
+ 0.09 - o1 — 0.66 - (—0.04 - i; + 0.38 - gı + 0.25) — 0.14 
> 0.20 - (0.5 - x + 1) — 0.13 - x — 0.10 > —0.03 - x — 0.08 > —0.11. 


The precision of the bounds generated by our LP-based method increases 
with the number of samples yielding optimal bounds (in the sense of small gap) 
asymptotically. For our example, the computed bounds are optimal. 

While our optimal bounds significantly improve precision compared to inter- 
vals, they are not sufficient to certify robustness. Prior work for ReLU networks 
[5, 12,23] showed that the greedy approach of always selecting the optimal bounds 
minimizing the gap can yield less precise results than an adaptive strategy which 
computes bounds guided by the certification problem. Based on this observation, 
we introduce a novel approach based on splitting and gradient descent that com- 
putes polyhedral abstractions for non-linearities employed in LSTMs informed 
by the certification problem and proves that min hg — hı > 0 actually holds. 


Abstraction Refinement via Splitting and Gradient Descent. While our 
method based on LP offers an efficient way to compute polyhedral abstraction of 
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activation functions, its main limitation is that the abstraction cannot be refined 
based on the certification goal. In this work, we introduce a novel method where 
we first compute mutiple sound bounds for the neuron using our LP method and 
then automatically obtain a combination of the computed bounds that improves 
the lower bound of our certification objective hə — hı for each input example. As 
before, we use the backsubstitution to obtain the interval bounds for the input 
variables. Since the output of our LP method is sensitive to the choice of the 
sampled points, we split the original input region [/,, uz] x [ly, uy] to sample more 
effectively from smaller sub-regions thereby reducing the chances of missing an 
outlier. We found that splitting along the two diagonals of [l,, uz] x [ly, uy] into 
four triangular zones, denoted as Tk, k € {1,2,3,4}, performs the best in our 
evaluation. We use Jọ to denote the original input region. Next, we calculate 
four additional planes, for both the upper and lower bounds, by sampling each 
subregion 7; and then applying our LP method as before. We refer to each plane 
as a candidate bound: 


YO (o(a:) tanh(y;) — (Ai - z; + Bi- y: + C1) 


i=l 


min 
A,,B1,C,ER 
subject to \ A; xi + Bi- yi + Cy < o(a;) tanh(y;) where (xi, yi) ~ Tk 
i=1 

Using our LP based method, we obtain the following corresponding candidate 
polyhedral abstraction for hə, LB}, for each Tp in our example: 
hy > LB}, = 0.04 - 0; + 0.46 - cı + 0.01, hg > LB}, = 0.04 - 0, + 0.46 - cı + 0.01 
ha > LB}, = 0.13 - o1 + 0.63 - cı — 0.17, he > LB}, = 0.04 - 0; + 0.46 - cı + 0.01 
hg > LB}, = 0.13 - 0, +0.63 - cı — 0.17 


Note that LB}, denotes the polyhedral abstraction calculated for the whole 
region, and there might be duplicate LB’s when the curve in the given subregion 
is concave. The final bound LB),, is a linear combination of LBR : 


4 4 
LB, = > LB, = Sg = 
k=0 


k=0 


Our optimization algorithm, explained in Sect.5.2, learns the values of A; via 
gradient descent that maximizes min hə — hı. For our example, we obtain À = 
(0.09, 0.13, 0.34, 0.09, 0.35) as the set of coefficients which results in a new lower 
bound of hg > 0.10 - o2 + 0.58 - c2 — 0.11 for the neuron hə. We improve the 
bounds for other neurons in a similar fashion. Using the new bounds, we obtain 
hg — hı > 0.01 > 0 which enables us to correctly certify the predicate of interest. 
If the certification still fails, it is possible to further refine the abstraction by 
increasing the number of splits and repeating the procedure above. 

Compared to [21], which uses a single bound, our method is more flexible and 
can tune A parameters to find a combination of different bounds for each neuron 
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that yields the most precise certification result for each certification instance. 
Our method is also faster as it performs expensive gradient-based optimization 
for only the output layer whereas [21] performs this step for each neuron in the 
LSTM twice. [5,12,23] also suggest a similar idea of bounding ReLU’s lower 
bound using gradient descent, but their approach is limited to unary functions 
with trivial candidates, not applicable to our setting which requires handling 
complex binary operations with non-trivial initial bounds. 


Generality of Our Method. Our method is generic and can be easily extended 
to obtain polyhedral bounds for the non-linear operations in other architectures 
such as transformers [42] and capsule networks [34]. 


5 Scalable Certification of LSTMs 


Next, we formally describe our scalable verifier for LSTM networks. As men- 
tioned in Sect. 4, we build our verifier based on the DeepPoly abstract domain 
[39] introduced in Sect. 3.4. For simplicity, we focus on computing the polyhedral 
bounds for the output of non-linear operations. Note that the computed polyhe- 
dral bounds contain only the neurons from the previous layers. This restriction 
is required for backsubstitution used for computing the interval bounds of the 
inputs, which is an approximate algorithm for solving an LP (e.g. maximize 
or minimize xj) within a polyhedral region defined by DeepPoly constraints. 
In Sect.5.1, we show how to obtain tight, asymptotically optimal polyhedral 
bounds on key operations in the LSTM unit: o(x) tanh(y) and o(x)y. Section 5.2 
describes a novel method to dynamically choose between different polyhedral 
bounds for increasing verifier precision. 


5.1 Computing Polyhedral Abstractions of LSTM Operations 


Our goal is to bound the products of sigmoid and tanh and sigmoid and identity, 
using lower and upper polyhedral planes parameterized by coefficients A;, Bı, 
Cı and Ay, By, Cu, respectively. Let f(x,y) = o(x) tanh(y) and g(x,y) = o(x)y. 
For h € {f,g} we describe how to obtain the lower and upper bounds of h: 


Aj: t+ Bi- y +C < h(xz,y) < Au £+ By -ytC, 


We formulate the search for a lower bound of h(x,y) as an optimization 
problem that minimizes the volume between the bound and the function, subject 
to the (soundness) constraint that the lower bound is below the function value: 


min (hlæ, y) — (Are + Bi- y + Cy) 
Aı,Bı,Cı on 


subject to Ay- x + Bı -y + Cı < h(a, y), V(x, y) € B. (1) 


We denote B = [lz, uz] x [ly, uy] as the boundaries of input neurons z and y 
obtained using backsubstitution. We next describe our method to solve Eq. (1). 
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Step 1: Approximation via LP. We solve an approximation of the intractable 
optimization problem from Eq. (1), obtaining potentially unsound constraints. 
Unsoundness implies that there can be points in region B which violate the 
bounds. We build on the approach from [4], which proposes to approximate the 
objective in Eq. (1) using Monte Carlo sampling. Let D = { (£1, y1), - -< , (En, Yn) } 
be a set of points from B sampled uniformly at random. We phrase the following 
optimization problem: 


i h(x;,yi) — (Ar: zi + Bi- yi 
m Ben De Menn) — (Ar a+ Biy + OD) 
subject to /\ Ar- s: + Bi- yi + Cr < h(2i, yi). (2) 
t=1 


Figure3 shows an input region with 
Monte Carlo samples as red circles and 
summands in the LP objective as verti- 
cal lines. As this is a low-dimensional lin- 
ear program (LP), we can solve it exactly 
in polynomial time using off-the-shelf LP , 
solvers. We compute a candidate upper 
bound analogously. 


Step 2: Adjusting the Offset to Guar- 
antee Soundness. Since we compute the 
lower bound from a subset of points in B, Fig.3. Visualization of the z = 


there can be a point in B where the value ¢(z) tanh(y) curve and the lower bound 
of h(a, y) is less than our computed lower computed by linear programming. Red 
bound. To ensure soundness, we compute crosses represent the sampled points 
A; = MiN(z,y)€B h(x, y)—(Arx+Bry+Cı) and dashed lines show the difference 
and then adjust the lower bound by updat- between the curve and the plane (sum- 
ing the offset C, + Cı + Aj, resulting mands in the optimization). (Color 
in a sound lower bound plane. While the gure online) 

method of [4] also performs offset calculation for obtaining sound bounds, they 
perform certification of image classifiers against geometric perturbations using 
expensive branch and bound for calculating the offset. In contrast, we exploit 
the structure of non-linearities used in LSTMs obtaining a closed-form formula 
for the offset yielding an exact solution. We now provide details of our offset 
adjustment method for f(x,y) = a(x) tanh(y) and g(a, y) = o(x)y. 


Offset Calculation for f(x,y) = o(x) tanh(y): Let A;- 2+ Bı -y + Cı be the 
initial lower bounding plane obtained from LP in region B. We define F(z, y): 


F(x,y) = o(x) tanh(y) — (A4 -£ + Bi- y+ C1) 
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To find A; = min(z,,)eB F(x,y), we first find the extreme points by comput- 
ing partial derivatives. 


OF _ 


an o(x) tanh(y)(1 — o(x)) — Ay (3) 
a = o(x)(1 — tanh?(y)) — Bı (4) 


We consider three cases: 


— Case 1: x € {lz,uz} and y E [ly,uy] Under this condition, we denote 


S, := g(x) as a constant. To ease notation, let t = tanh(y) where t € 
[tanh(/,), tanh(u,)]. Then a = 0 can be rewritten as: 
1-f = Bı/ Sz (5) 


— Case 2: y E€ {ly,uy} and x E [lz,uz] Here we set Ty := tanh(y) and s = 


a(x), x € [o(l,), o(uz)] analogously. oF ~ 0 is rewritten to: 
s(1—s) = Aı/Ty (6) 


— Case 8: otherwise Otherwise, we consider both oF + 0 and a + 0. By 
combining Eq. (3) and Eq. (4), we reduce tanh(y) and obtain: 


st + (—2 — Bi)? + (1 + 2B1)s? + (—B1)s — A? +0 (7) 


Given that F(x,y) is differentiable and the region B is compact, Fermat’s 
theorem (stationary points) [1] states that F achieves its extremum at either the 
roots of Eq. (5), Eq. (6), and Eq. (7), or at the 4 corners of B. We evaluate F 
at these points to get A;. We adjust the offset by replacing C; + Cı + A;. The 
adjusted F is no less than 0 on any point in B, which means that the plane with 
adjusted C; becomes a sound lower bound of the o(x)tanh(y) curve. 


Offset Calculation for g(x,y) = o(x)y: We next calculate the offset for o(x)y. 
We define the differentiable function G(x, y) = o(x)y — (A -x + Bı- y + C1) over 
the compact set B and compute: 


De = IOWA — of) -A (8) 
OG 
a a(x) — Bı (9) 


We use Fermat’s theorem and consider three cases: 


— Case 1: x € {lz, uz} and y € [ly, uy] When o(z) is fixed, Eq. (9) is constant, 
which means G is monotonous in this case. 
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— Case 2: y € {ly, uy} and z € [lz, uz] Denote s = o(x) where s € [o (ls), a(uz)], 
then setting Eq. (8) = 0 becomes 

s(1 — s) = Ai/y (10) 


— Case 3: otherwise If there is a local extremum in the region, the Hessian of 
G must be either positive-definite or negative-definite. 


2G 2G 3G 
Fat TIOWU- o(@))(1—2o(e)), Fe =O 5-5 T70- ol) 


Hence, there is no local extremum inside the boundaries. 


To summarize, we only need to consider the roots of Eq. (10) to calculate the 
minimum of G to obtain A; for o(x)y. Figure3 shows the lower bound plane 
obtained after solving the LP and adjusting the offset. We update the upper 
bound analogously. 


Asymptotic Optimality. We can prove that, similarly to [4], as we increase 
the number of samples n, the solution of the LP asymptotically approaches the 
solution of the original problem from Eq. (1). Rephrasing and simplifying the 
theorem from [4]: 


Theorem 1. Let N be the number of points sampled in the algorithm. Let 
(w, bı) be our lower constraint (linear constraints and bias, respectively) and 
let L(w*,b*) be the true minimum of function L. For every 6 > 0 there exists 
Ns such that | L(w, bi) — L(w*,b*)| < ô for every N > Ng, with high probability. 
Analogous result holds for upper constraint (wu, bu) and function U. 


We denote L = Sen) F(x,y) and (w, b1) are our A;, By, Ci. Following the the- 
orem, our sampling method guarantees the asymptotic optimality of our bounds. 
The theorem can be extended analogously for the upper bound. 


5.2 Abstraction Refinement via Optimization 


While our approach based on sampling, linear programming, and Fermat’s the- 
orem allows us to obtain (asymptotically) optimal bounds, it still has a funda- 
mental limitation that it produces a single bound. Further, this approach is, in 
a sense, greedy: when considering the entire network, it is possible that selecting 
non-optimal planes for each neuron yields more precise results at the end. Neither 
the method from [21] nor the method in Sect. 5.1 achieves this. We present the 
first approach to learn an abstraction refinement that increases the end-to-end 
precision of certification. 


Step 1: Compute a Set of Candidate Bounds. We adapt our approach from 
Sect. 5.1 to compute a set of candidate planes, instead of a single plane. We run 
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the sampling procedure multiple times, each time on a different subregion of 
the original region B = [lz, ux] x [ly, uy], with the constraints still enforced over 
the entire region B. We define four different triangular subdomains: 7; and 7 
are triangles resulting from splitting B along the main diagonal, while 73 and 74 
are triangles resulting from splitting B along the other diagonal. We additionally 
define To = B. For each k € {0,1,2,3,4}, we perform sampling and optimization 
as in Eq. (2), this time sampling from Tẹ to obtain candidate lower bounds: 


4, BB en 2a )tanh(y,) — (A; 2; + By yi + C1)) 
subject to VAN Ai: xi + Bi- yi + Cı < o(a;) tanh(y;) where (zi, yi) ~ Tk 
i=1 
For each neuron i, this yields 5 candidate lower bound and upper bound 
planes, LB! and UBF for k € {0,1,2,3,4}. These five candidate planes for each 
of the N neurons are shown in Fig. 4. 


Fig. 4. Learning to combine linear bounds via gradient descent. Here the five candidate 
planes multiplied by A are depicted either in green or red, or both. Green represents 
the sampled domain, Tp, and red is the extension of the obtained green plane out of the 
domain. With the linear combination of the planes, we compute the bound, calculate 
the loss, and backpropagate. (Color figure online) 


Step 2: Find the Optimal Combinations of the Bounds. Next, our goal 
is to learn a linear combination of the computed candidate bounds which yields 
the highest end-to-end certification precision for the given input region. To do 
this, we define the lower and upper bound of neuron 7 as a linear combination 
of the proposed five bounds: 


4 4 4 + 
LB: =X XP- LB}, Y Oy" =1,0 B= > Ap? UB}, XAP =L 
k=0 k=0 k=0 k=0 


Recall that we formulate robustness certification as proving that for all labels 
i different from the ground truth label t: z— z; > 0. The lower bound on 
2,—2; is computed using backsubstitution [39], as shown in our overview example 
in Sect.4. However, this lower bound now depends on the coefficients A, so we 
define the function f(a, €,i, A) which computes the lower bound of the expression 
zt — Zi when using A to combine the neuron bounds. 
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We describe our approach to find the best coefficients A in Algorithm 1. 
Consider the number of possible labels m and the number of binary operations 
of interest Nops. To find A, we solve the optimization problem for each label 1: 


Za — ži > max f(x,6,7,A) 


If the solution to the above optimization problem is positive, then we proved 
robustness with respect to class i. In the algorithm, we initialize À, a pre- 
normalized vector of A, for each neuron, uniformly between —1 and 1. Then, 
in each epoch, we compute the normalized A by applying softmax to À and run 
certification using A, obtaining a loss £ equal to the value —f(x,«,i,). We 
perform gradient descent update on À based on the loss. If the loss is negative, 
we have found A which proves the robustness and the algorithm terminates. The 
core updating flow is shown in Fig. 4. 


Algorithm 1. Learning A via gradient descent 
Given input a, label y, model M, perturbation € 
Initialize the polyhedral abstractions and candidate bounds based on a, M and e. 
for i + 1 to m where i 4 y do 
Initialize A ~ [—1, 1]Ners*, epoch + 0 
repeat 
A < SoftMax(A), L + —f(x,61,A), AC AR aVxL, epoch + epoch + 1 
until epoch = max_epoch or L < 0 
if £ > 0 then 
return not certified 
end if 
end for 
return certified 


6 Certification of Speech Preprocessing 


Speech preprocessing transforms the original set of perturbed speech signals, 
represented via intervals, through complex pipeline operations, into a non-linear 
and non-convex set. Propagating this set through the network is computationally 
expensive (infeasible for large models). To address this issue, we define precise 
overapproximations of key non-linear operations found in the speech prepro- 
cessing pipeline, such as Square and Log, expressed in the DeepPoly abstraction. 
These approximate bounds are computed via constant time closed form formulas 
based on concrete bounds of the inputs. We note that the first and third stages 
of the pipeline described in Sect. 3.3 involve an affine transformation, captured 
exactly using DeepPoly. Overall, when combined with our LSTM verifier, this 
method yields more precise end-to-end certification results than using intervals 
for approximating speech preprocessing. 


Square. The lower and upper polyhedral bounds of the output of the square 
function y = x? where x € [l,,uz] are shown in Fig.5a. We first consider the 
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bounds for y which minimize the area in the xy-plane. The upper bound UB, is 
obtained by computing the chord joining the end points (ls, l2) and (ux, u2). The 
lower bound is a line parallel to UB, passing through a point ((ug +1,)/2, ((ue+ 
lz)/2)?) in the middle of the curve. 


LBy = (Uz tle) -£ — ((te + le)/2)” ,UBy = (Ue + le) £ — us * læ. 


While the above bounds would be sufficient in any other domain, they do not 
work for the speech domain as the subsequent Log requires that the input is 
strictly non-negative, as it is not defined for negative inputs. Also, we should 
carefully consider the floating point errors during calculations. Hence, we intro- 
duce the additional parameter 6 € R, a small threshold value to ensure the lower 
bound stays non-negative. In our experiments, we set ô = 1 x 1073. Upper and 
lower bounds for y = x? are computed as U B} = (us +l): £ — uz le and LBy = 


UB, = (uz tle) -£ — Urs le UB, = 2-a 14 log (ttle) 


LB, = log(lz) + Z= log( 42) 


(a) Abstraction for square function with 
threshold ô = 0 (b) Abstraction for log function. 


Fig. 5. Two polyhedral abstractions for the speech preprocessing stage. 


2+ (le + 4/12 — 6) - z — (In + \/12 — ô)? 3-12 +2-lyug —u2 < 4-6, V5 < le 
2: (ux uz — ô) - x — (uz u2 — 86)? 3-u2 4+ 2ugle — l2 <4- ô, us < —6 
0 lz < Vô, — v8 < Uz 

(uz + le) + £ — ((uz + lx)/2)? o.w. 


Log. We define the polyhedral abstraction of the output y = log(x) of the log 
operation where x € |lz, Uz], as shown in Fig.5b. Our abstractions are optimal 
and minimize the area in the xy-plane. The lower bound LB, is the chord joining 
the end points (l,,log(/,)) and (wz, log(u,)). The upper bound UB, is obtained 
by computing a line parallel to LB, passing through the middle of the curve at 
((ux + ly) /2,log((uz + ls)/2)). Our final abstractions are: 


LB, = log(la) + ~—* log(*), UB, = —— — 1 + log( 


Uz — le m Uz + ly 2 


). 
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7 Experimental Evaluation 


We implemented our approach in a verifier called PROVER, using PyTorch [30] 
and Gurobi 9.0 for solving linear programs. The code is available in https:// 
github.com/eth-sri/prover. We evaluate PROVER on speech classifiers for FSDD 
[17] and GSC v2 [44] datasets. Then, we compare PROVER against POPQORN 
[21] on the MNIST image classification task proposed by it. We note that 
POPQORN does not scale to the speech classifiers considered in our work. We 
demonstrate further scalability by verifying large motion sensor sequence classi- 
fier trained on HAPT [33] dataset containing 256 hidden dimensional 4 layered 
LSTM units. 


Setup. GSC dataset experiments ran on an Nvidia GeForce RTX 2080, while 
the rest ran on a single Tesla V100. Following convention from prior work [39], 
we consider only those inputs that are classified correctly without perturbation. 
We use the same set of hyperparameters for the experiments unless specifically 
mentioned. We use 100 sampling points for constructing the linear program and 
optimize A parameters using Adam [20] for 100 epochs. During optimization, we 
initialize the learning rate to 100 and multiply it by 0.98 after every epoch. 


7.1 Speech Classification 


We certify the robustness of two speech classifiers for the FSDD and GSC v2 
datasets. FSDD consists of recordings of digits spoken by six different speak- 
ers, recorded at 8kHz. GSC has 35 distinct labels of single word utterances at 
16kHz. We compare our base method based on sampling and linear program- 
ming (Sect. 5.1), denoted as PROVER (LP), and our method using abstraction 
refinement via optimization (Sect. 5.2), denoted as PROVER (OPT). 


Preprocessing. A key challenge in speech classification, not encountered in the 
image domain, is the complex preprocessing stage before the LSTM network. 
The preprocessing stage in this experiment consists of FFT and Mel-filter trans- 
formations. Preprocessed input then passes through the fully connected layer 
with ReLU activation followed by the LSTM unit. 


FSDD Certification. We used the following parameters for the preprocessing: 
we slice the raw wave signal with length 256 using a stride of 200 with 10 Mel- 
frequencies. For this experiment, we trained an LSTM network with two LSTM 
layers and 32 hidden units each, preceded by a 40 ReLU-activated fully-connected 
layer. This network achieves an accuracy of 83.6% on the FSDD task. The aver- 
age number of frames was 14.7. We verify the first 100 correctly classified inputs 
for each perturbation. Our perturbation metric on speech classification tasks is 
described in Sect.3.1. Our results are shown in Fig. 6a and Fig. 6b. We vary the 
decibel perturbation between —90 dB and —70 dB and evaluate the precision and 
runtime of PROVER. Figure6a shows the percentage of certified samples: our 
method based on optimizing the bounds (OPT) performs best, e.g., certifying 
twice as many samples compared to LP, for a significant perturbation of —70 
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Fig. 6. Performance plots for the FSDD and GSC datasets with different perturbations. 
All tests are done with the same architecture described in the text. 


dB. In terms of runtime, Fig. 6b shows that the OPT runtime increases with the 
perturbation magnitude, meaning that the optimizer needs more iterations to 
converge to the resulting bounds. 


Interval vs. Polyhedral Abstraction for Speech Preprocessing. We stud- 
ied experimentally the importance of designing precise polyhedral abstractions 
of the speech preprocessing pipeline. If we replace the polyhedral bounds for 
the square and logarithm operations with interval constraints, the precision of 
PROVER (LP) drops from 86% to 61% on -90 dB and from 70% to 20% on -80 
dB. This shows the importance of keeping relational information while overap- 
proximating the speech preprocessing pipeline. 


GSC Certification. We used the following parameters for the preprocessing: 
we downsample the raw input to 8kHz, sliced the signal in length 1024, followed 
by 10 Mel-frequency filterbanks. As with the FSDD architecture, we used two 
layers of LSTM and 50 hidden units each, preceded by a 50 ReLU-activated 
fully-connected layer. This network achieves accuracy of 80% on the GSC task. 
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Certifying the GSC classifier is more challenging than FSDD: this dataset has 
35 labels, compared to 10 in FSDD. The larger label set size requires PROVER 
to compare 34 output differences - acquiring the lower bounds of ler — lrz 
where each term stands for the final output score for the ground truth and false 
label, respectively. Figure 6c shows the percentage of certified samples: 75% on 
-110 dB and 46% on -100 dB with PROVER (OPT), again higher precision than 
PROVER (LP). Figure 6d shows the longer running time for PROVER (OPT) than 
on FSDD, due to its larger label set size. 


7.2 Image Classification 


Based on the setup from [21], we flatten each image into a vector of dimension 
784. This vector is partitioned into a sequence of f frames (f depends on the 
experiment). Next, the LSTM uses this frame as an input. 


Comparison with POPQORN. We compare the precision and scalability of 
PROVER against POPQORN [21]. We trained an LSTM network containing 1 
layer with 32 hidden units using standard training, achieving an accuracy of 
96.5%. The network receives a sequence of f = 7 image slices as input and 
predicts a digit corresponding to the image. 

As POPQORN is slow, we used only ten correctly classified images randomly 
sampled from the test set. For each frame index i and each method, we compute 
the maximum perturbation bound e such that the method can certify that the 
LSTM classifier is robust to perturbations up to e in the L..-norm of the i-th 
slice of the image. We determine the maximum e€ using the same binary search 
procedure as in [21]. 


Table 1. Certification of several LSTM models using PROVER with e = 0.01. F, H, 
and L denote the number of frames, LSTM hidden units and layers respectively. 


F|H |L | Accuracy (%) | Certified (%) by OPT/LP by OPT Running time (s) 
4 | 32)|1 | 96.1 91/89 14.5 
4 | 32 |2 | 96.7 92/73 29.1 
4 | 32 |3 | 95.8 95/65 43.1 
4 | 64} 1 | 97.3 93/92 27.0 
4 |128)1 | 97.1 95/95 52.4 
7 | 32 |1 | 96.5 63/56 32.1 
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Figure 7 presents the results of this Maximum pertubation € per frame 
experiment. We observe that, for all —e— PROVER (OPT) 


three methods, early frames have smaller 015| +—Prover (LP) 
—— POPQORN 


e certified perturbation bounds than 
the later frames. The reason is that 
the approximation error on frame 1 
propagates through the later frames to 
the classifying layer, while the error 


0.10 


on frame 7 only affects the last layer. oos 

Across all frames, both our LP and 

OPT methods significantly outperform 0.00 > 
POPQORN, meaning that PROVER BoA B ha e 


: i Frame index 
enables a more precise abstraction than 


POPQORN. As for speech classifiers, Fig.7. Results for the comparison 
OPT is more precise than LP. We com- between PROVER and POPQORN. Plot- 
pare running times of the three meth- ted points represent the maximum Loo 
ods on perturbations in the first frame — norm perturbation for each frame index 
most challenging as it requires propagat- 1 through 7. 

ing through all timesteps. Here, PROVER 

(LP), PROVER (OPT), and POPQORN 

take 65,348, and 2,160 s respectively per example on average. We conclude that 
both variants of PROVER are more precise than POPQORN while being 33.2x 
and 6.21x more scalable for LP and OPT respectively. 


Effect of Model Size. We evaluate the scalability of PROVER by certifying sev- 
eral recurrent architectures, with varying number of frames F, hidden units H 
and LSTM layers L. For each network, we certify the first 100 correctly classified 
images using the same perturbation € = 0.01 for all frames, with 3 repetitions. 
While in the previous experiment we certified each frame separately to closely 
follow the setup from [21], it is more natural to assume the adversary is able 
to perturb the entire input. The results are shown in Table 1. We observe that 
the precision of PROVER is affected mostly by the number of frames, as the pre- 
cision loss accumulates along the frames. Naturally, the running time increases 
with the number of neurons and frames, as PROVER is optimizing the bounds 
for each o(x)tanh(y) operation. However, we also observe a counter-intuitive 
phenomenon that PROVER (OPT) performs better with multi-layer models than 
with the single-layer model. The precision from PROVER (LP) drops with the 
number of LSTM layers unlike those from PROVER (OPT). We hypothesize 
that an increased number of trainable parameters enhances the flexibility of 
the bounds for the optimization, allowing us to find more combinations of the 
bounds that certify the input. PROVER (LP) has non-flexible bounds, so the 
error propagates. 


Effect of Perturbation Budget. We certify the robustness of the MNIST 
classifier for different € values. We again evaluated 100 correctly classified samples 
from the test set. Figure 8 shows the experimental results. The OPT version has 
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Fig. 8. Results on MNIST with different epsilons and F = 4, H = 32, L = 2. 


significantly higher precision than LP: i.e., for e = 0.013 in Fig. 8a, LP proves 
39% while OPT certifies 89% of samples with a higher runtime in Fig. 8b. 


7.3 Motion Sensor Data Classification 


We further demonstrate the scalability of PROVER by considering a large classi- 
fier containing 4 LSTM layers with 256 hidden units each for the human activity 
recognition dataset HAPT [33]. Each input in the dataset consists of recorded 
triaxial linear accelerations and angular velocities, sampled 50 Hz. Here, we 
restricted HAPT to six activity classes and we trimmed angular velocities to 
at most 6s after the point of prediction. Each input sequence is sliced into slid- 
ing windows of 0.5s, which are then passed as an input to the classifier. The 
trained classifier achieves 88% test accuracy. Identical to the other experiments, 
we run PROVER on the first 100 correctly classified inputs. 
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Fig. 9. Results on HAPT with different epsilons and H = 256, L = 4. 
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Results, shown in Fig.9, indicate that PROVER (OPT) verifies more inputs 
than PROVER (LP), for all perturbation budgets. Although the number of param- 
eters has increased, Fig. 9b shows smaller running times compared to Fig. 6b and 
Fig. 6d. This is because of the smaller number of classes in HAPT, as the verifi- 
cation needs to perform the backsubstitution for each incorrect class. This result 
shows that PROVER (i) is applicable to LSTM classifiers in various domains, and 
(ii) scales to the large models. 


8 Conclusion 


We introduced a novel approach for certifying RNNs based on a combination of 
linear programming and abstraction refinement. The key idea is to compute a 
polyhedral abstraction of the non-linear operations found in the recurrent cells 
and to dynamically adjust this abstraction according to each input example 
being certified. Our experimental results show that PROVER is more precise and 
scalable than prior work. These advances enable PROVER to certify, for the first 
time, the robustness of LSTM-based speech classifiers. 
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Abstract. This paper presents Verisig 2.0, a verification tool for closed- 
loop systems with neural network (NN) controllers. We focus on NNs 
with tanh/sigmoid activations and develop a Taylor-model-based reach- 
ability algorithm through Taylor model preconditioning and shrink wrap- 
ping. Furthermore, we provide a parallelized implementation that allows 
Verisig 2.0 to efficiently handle larger NNs than existing tools can. We 
provide an extensive evaluation over 10 benchmarks and compare Verisig 
2.0 against three state-of-the-art verification tools. We show that Verisig 
2.0 is both more accurate and faster, achieving speed-ups of up to 21x 
and 268x against different tools, respectively. 


1 Introduction 


Following their increasing popularity, neural networks (NNs) have been recently 
introduced to various new domains, including safety-critical systems such as 
autonomous cars [4] and airborne collision avoidance systems [21]. At the same 
time, NNs have been shown to be greatly susceptible to input perturbations: 
minor input changes can cause a NN’s outputs to vary drastically, as is the case 
with adversarial examples [26]. Such issues have emphasized the need to formally 
analyze NN-based systems and assure their safety before they are deployed. 

A number of formal verification approaches have been proposed in the last 
few years to analyze closed-loop systems with NN components. On the one hand, 
several techniques have been developed for reachability analysis. These works 
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Fig. 1. Overview of the closed-loop system considered in this paper. 


handle the NN reachability problem in a variety of ways: by converting the NN 
into a hybrid system [19]; by casting the problem into a satisfiability modulo con- 
vexity problem [25]; by approximating the NN with a Taylor model [8, 11, 16, 20]; 
or by propagating NN reachable sets using star sets [27,28]. Multiple falsifica- 
tion techniques have been developed as well: these approaches work by adapting 
existing hybrid-system falsifiers [2,6] to the NN case [7,29,33]; methods for sys- 
tematic testing through scenario specification languages have been proposed as 
well [14]. Finally, a number of techniques have been developed to analyze proper- 
ties of the NN in isolation (e.g., input-output properties) [9, 10, 12,15, 22, 30-32], 
though it is challenging to use these tools in a closed-loop setting as it is unclear 
what NN specification ensures closed-loop safety in general. 

While existing reachability techniques have shown impressive performance, 
scalability remains an obstacle to applying these tools to realistic systems. In 
particular, these methods have been evaluated mostly on low-dimensional sys- 
tems, i.e., systems with several states and at most 41 measurements [18]. The 
main scalability challenge stems from the fact that reachability is undecidable 
even for linear hybrid systems [1]. Thus, all approaches overapproximate the 
true reachable sets using a computationally convenient representation such as 
polytopes [13] or Taylor models [5]. At the same time, this overapproximation, 
known as the wrapping effect, leads to quick error accumulation over time, thus 
making it challenging to verify complex specifications over a longer time horizon. 

To address these limitations, we present Verisig 2.0, a scalable tool for ver- 
ifying safety properties of closed-loop systems with NN controllers. We com- 
bine ideas from NN reachability with ideas from hybrid system verification. 
In particular, we adopt the idea of approximating NNs with Taylor models 
(TMs) [11,16,20], and we alleviate the wrapping effect through TM precondi- 
tioning and shrink wrapping [3, 23,24]. Finally, we note that the NN reachability 
computation can be parallelized since each neuron in a layer can be analyzed 
independently. We have implemented our tool in conjunction with the hybrid 
system tool Flow* [5], which enables us to handle general hybrid system models 
with NN components. 

We compare Verisig 2.0 against three tools, namely Verisig [20], NNV [28], 
and ReachNN* [11]. We use 10 benchmarks that illustrate various challenges, 
such as hybrid models, non-linear systems and systems with high-dimensional 
observations. The results indicate that Verisig 2.0 is significantly faster (achiev- 
ing speed-ups of up to 21x and 268x against Verisig and ReachNN%, respectively) 
and produces tighter reachable set approximations on all benchmarks. 
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In summary, this paper has three contributions: 1) a Taylor-model- 
based NN reachability method using TM preconditioning and shrink wrap- 
ping; 2) an efficient implementation that allows for parallel execution; 3) 
an extensive comparison against existing tools on 10 diverse benchmarks. 
The source code to reproduce the results is available online (github.com/ 
rivapp/CAV21_repeatability_package) as well as in the main Verisig repository 
(github.com/Verisig/verisig). 


2 Problem Statement 


This section outlines the reachability problem addressed in this paper. We con- 
sider a closed-loop system, illustrated in Fig.1, consisting of: a) a plant with 
states modeled as a hybrid system; b) measurements y produced as a function 
of x; c) an NN controller h that takes y as input and produces controls u. 


Plant Model. We assume the plant is modeled as a standard hybrid system. In 
particular, the state space X = Xp x Xc consists of continuous variables Xc 
and discrete locations Xp = {q1,.--,;@m}. When in location q € Xp, the system 
evolves according to differential equations fj, i.e., ¢ = fọ(x, u), where x € Xc. 
Each location q € Xp has an associated invariant I(q) C Xc that must hold true 
in that location. Transitions between locations are enabled by guards, which are 
boolean predicates on the continuous variables. Finally, each continuous variable 
may be reset to a new value when transitioning to a new location. 


Observation Model. The system produces observations y = g(x), where 
g: X — R’. Note that some benchmarks in this paper use state feedback only, 
i.e., Y = T. 


Controller. The controller h is a fully-connected feedforward NN with sig- 
moid/tanh activations. Formally, h can be represented as a composition of its L 
layers: 

h(y) = hr ohr-10--- 0 hi(y), (1) 


where each h;(y) = a(Wiy + bi) performs a linear function, with parameters W; 
and b; identified during training, followed by a sigmoid/tanh activation a. 


Composed System. Although the hybrid system formulation places no restric- 
tions on the controller/plant composition, in the interest of clarity we assume 
the controller is executed in a time-triggered fashion, with sampling period T, 
as follows: u(t) = h(y(tx)), for t € [tk,tk +T), where t = kT and k = 0,1,2,... 


Closed-Loop Reachability Problem. Let S be a composed system. Given an initial 
set of states x(0) € Xo, the reachability problem, expressed as property ¢, is to 
verify a property w of the reachable states of S: 


(Xo) = (2(0) € Xo) > Y(a(t)), Yt > 0. (2) 
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3 Background: Neural Networks as Taylor Models 


As described in Sect. 1, in this work we adopt a TM-based approach for propagat- 
ing NN reachable sets. There are two main reasons for this: 1) TMs can approxi- 
mate any differentiable function over a bounded range given a high enough order; 
2) TMs are very effective at approximating hybrid system reachable sets, which 
allows for a smooth composition between the NN and the rest of the system. 
The rest of this section formalizes TMs and summarizes the existing approaches 
to using TMs for NN reachability. 


Taylor Model Definition. Intuitively, a TM of a function f is a polynomial 
approximation p, together with a worst-case error bound I. A j-degree polyno- 
mial approximation p of a j times continuously differentiable function f around 
a point x, written p(x) =; f(x), is a polynomial p of degree j such that all partial 
derivatives of f and p coincide at x. Let I be the set of all intervals J = [a,b 
and let f : D — R be a function of n variables defined over a domain D € I”. 
Then a Taylor model of f over D of degree j is a pair (p, I) of a polynomial 
approximation p and an error bound IJ (also known as a remainder) such that: 


1) f(c) =; p(c), where c is the center of D, 
2)Vx € D, f(x) € {p(z) +e |e€ I}. 


Taylor Model Arithmetic. Let TM, = (pi, 1,) and TMp = (po, Iz) be two TMs 
defined over a domain D. Addition and multiplication are defined as follows [5]: 


TM, + TM = (pı + pa, li + I2) 
TM, x TM> = (pı X pa, Int(p1) Io + Int(po)lh +I, x D), 


where Int(p) is an interval bound of p over D. 

TMs have shown impressive performance in hybrid system reachability prob- 
lems due to their ability to approximate any continuously differentiable function 
given a high enough order [5]. Another appealing feature is that TMs can be used 
to approximate solutions of differential equations through Picard iteration [5]. 
Thus, it is natural to try to use TMs to approximate NN reachable sets as well. 

Two classes of approaches for approximating NNs with TMs have been devel- 
oped in the literature. The first one is sampling-based: given a TM TM; of the 
inputs y to h, these methods sample points Z from TMy and corresponding 
outputs h(Z) to perform polynomial regression [8] or approximation [16]. While 
these approaches work well for systems with several state variables, they cannot 
handle higher-dimensional problems due to insufficient sampling. 

A second approach to using TMs for NN reachability is to propagate the 
TMs through each neuron in the NN. Specifically, let TMy = (p, I) be the TM 
for y and consider a neuron v that computes the function o(wy + b), where o 
denotes the sigmoid. One can use TM arithmetic [5] to obtain TMz = (wp + 
b, wl) for the linear map in v. For the sigmoid TM, TMs, one could obtain 
a Taylor series expansion of ø around the center of TM; and get remainder 
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bounds using Taylor’s theorem [20]. Thus, the final TM for v is TM, = TM, o 
TMz. The benefit of propagating TMs in this fashion is that no sampling is 
necessary since the NN is approximated directly. On the other hand, scalability 
challenges manifest in a different way, namely the TM remainders may grow 
quickly depending on the NN architecture (explained in more detail in Sect. 4). 

We adopt the latter approach to approximating NN as TMs due to its 
improved scalability. The next section describes our approach to reducing the 
TM remainder size through TM preconditioning and shrink wrapping. 


Remainder 
Large wrapping effect Smaller wrapping effect 
rt 


pe 


m 


Taylor models | N New Taylor model, 


Symbolic part of H ae i 

+ Taylor model 7 C) ¥ containing original 
: | | i { 
te nterval bounds 274 


Fig. 2. The wrapping effect for different Fig. 3. Illustration of the shrink wrap- 
taylor model orientations. ping method. 


4 Taylor Model Preconditioning and Shrink Wrapping 


This section presents our approach to limiting the remainder growth as TMs are 
propagated through the NN. We explore two complementary techniques, namely 
TM preconditioning and shrink wrapping. Both of these ideas were originally 
developed for the purpose of reachability analysis of hybrid systems [3,23] — in 
this paper, we adapt them to the NN case. 


4.1 Taylor Model Preconditioning 


As noted in Sect. 3, although propagating the TM through the NN is preferred 
since it captures the functional representation of each neuron, it may suffer from 
quick remainder growth. The following example illustrates this process. 


Example 1. Let yı and y2 be inputs to the NN h with corresponding TMs 
TM, = (pı, h) and TM}, = (p2, I2) over domain D € I”. Let v be a neu- 
ron in the first layer implementing the function v(y1, y2) = o(w1yı + w2y2 + b). 
The TM for the linear part of v is 


TM, := (pL, IL) = (wipi + wep2 + b, w111 + w212). 


Let TM, = o(a) +0'(a)(T M; — a) +0” (a)(T M; — a)? /2+ Is be a second-order 
Taylor series expansion of the sigmoid around point a, with remainder J,. Using 
TM arithmetic [5], the TM for v is TM, = (pv, Iv), where 

py = o" (a)p?, + (o' (a) — ao" (a))pr — (o' (a) — 0.5ao" (a))a + o(a) 

I, = o" (a)(2Int(pr)Ir + I?) + (o'(a) — ao” (a))Ir + Io. 
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Remark 1. In order to compute a TM, = (po, Io) for the sigmoid/tanh, one 
can follow the procedure described in prior work [20]. In summary, the following 
three steps need to be performed, assuming the input TM is denoted by TMz: 


1. compute interval bounds, [a,b], for TMz using interval analysis; 

2. obtain a Taylor series approximation, p,, of the sigmoid/tanh around the 
midpoint of [a,b]; 

3. obtain worst-case error bounds, Is, for po using Taylor’s theorem. 


As shown in Example 1, the remainder is propagated using interval analysis, 
where a major contributor is the Int(pz,) term, i.e., the interval bounds of pr. 
Since this term approximates the (potentially high-dimensional) input TM with 
a box, it may introduce significant wrapping effect if the input TM is not a 
box, as illustrated in Fig. 2. The natural way to address this wrapping effect is 
through rotating the TM in order to align it with the axes [23, 24]. 


Algorithm 1. NN Verification Using Taylor Model Preconditioning 
Input: Measurement TM Vector TMV,, NN h with L layers, and sigmoid activations. 
1: TMVo — TMV, 
2: for each i in {1,..., L} do 
3 TMVř — Wi *TMVi-1 + bi 
4 (Q+c,0)o(R+Q'N,Q' I) Taylor ModelPreconditioning(T MV;,") 
5: TMV,’ — Taylor Model For Sigmoid((Q,0)) //Taylor series approximation 
6 
7 
8 


TMV; —TMV/ o (R+Q' (e+ N),1I) 
: end for 
: return TMV 


Since the set represented by a TM is the image of a polynomial over a given 
domain, it is challenging to choose an appropriate rotation matrix. However, 
as discussed in prior work [23,24], if one first normalizes the TM so that the 
domain is [—1,1]", then the linear terms become the largest contributors to 
interval analysis overapproximation (since higher order terms are less than 1 in 
magnitude). Thus, a good choice for a rotation matrix is the matrix formed by 
the linear terms of the (normalized) TM. 

To formalize the above concept, let us decompose a TM vector TMV = (p, I) 
into TMV = (c+ M + N,I), where c denotes the constant terms, M denotes 
the linear terms and N denotes the higher-order terms. The idea of precondi- 
tioning is to decompose M = QR, where Q is an orthonormal matrix and R is 
upper-triangular. This is achieved by splitting TMV into a composition of two 
TM vectors: TMV = (Q+c,0)0o(R+Q'N,Q'I).! Then, each neuron’s compu- 
tation is performed on Q only, which alleviates the wrapping effect introduced 
by Int(pz) in Example 1 since Q is orthonormal. 


1 Note that the new remainder may need to be enlarged to also include numerical 
errors due to the computation of Q. 
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The algorithm is presented in Algorithm 1. Note that preconditioning is 
performed in each layer, followed by again composing the two parts into the full 
TM. While it is possible to represent the final TM as a composition of individual 
layer TMs, the benefits of preconditioning would decrease after the first layer, 
since most of the variability is captured in the right-most TM. 


4.2 Shrink Wrapping 


In systems where verification over a longer time horizon is required, avoiding 
large remainders may be impossible even with effective preconditioning. In such 
cases, one could use shrink wrapping in order to refactor the TM into one that 
results in slower remainder accumulation in the future [3,24]. 

The high-level idea of shrink wrapping is illustrated in Fig. 3. If the remainder 
becomes a significant part of the set described by the TM, then TM arithmetic 
degrades into standard interval analysis. In this case, it helps to transform the 
TM into a new TM that contains the original one but has no remainder. Thus, 
even if the new TM is slightly larger, it is propagated symbolically using TM 
arithmetic, which results in smaller error accumulation in the long run. 

The choice of new TM is not obvious and is affected by the system in 
consideration. The standard approach in related work [3,24] is to focus on 
the linear terms (assuming the TM is normalized so that D = [—1,1]”). 
Specifically, suppose that the system’s state x is described by the TM vector 
TMV, = (p,I) = (c+ M + N,I). One option for the new TM vector is to 
premultiply TMV, by M~t, thereby reducing the linear terms to the iden- 
tity matrix, Z. Then a shrink wrap factor q is chosen such that the image of 
the higher-order terms contains the remainder of the initial TM vector, i.e., 
TMV2 = (c+LI+qM—'N,0).? 

While it is possible to choose q by finding bounds on the partial derivatives of 
the higher-order terms M~'N [3], our initial experiments indicated that a more 
straightforward approach leads to no loss in precision. In particular, we represent 
the new TM vector as TMV?" = (c+ diag(q),0), where q = Int(TMV,). 
The last consideration is when to perform the TM conversion: if it is applied 
too often, more error could be introduced by the frequent elimination of useful 
information in the TMs. In our experiments, shrink wrapping is triggered when 
the remainder is larger than 1le~® and larger than 1% of the total TM range. 


5 Implementation 


We implemented our approach in conjunction with the Flow* tool [5], for easy 
integration with standard hybrid system models. We provide similar TM func- 
tions to the ones existing in Flow*, adapted to the case of NNs. In addition to 
modified data structures, a main difference in our implementation is the option 
to parallelize the TM vector propagation, i.e., Line 5 in Algorithm 1. This par- 
allelization is possible since each neuron in a layer only depends on the input 


? The new remainder may be greater than 0 due to round-off error during the inversion. 
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TMs, thus each computation can be done on a separate core. As illustrated in 
Sect. 7, this implementation brings great benefits, especially in the case of larger 
NNs, where multiple neuron computations can be performed in parallel. 


6 Benchmarks 


We use 10 benchmarks to evaluate the proposed approach. These benchmarks 
were compiled from the related literature [17,19, 20, 28] and were selected in order 
to cover a wide variety of systems and controllers: 1) continuous and hybrid 
systems; 2) systems with state feedback and systems with measurements as a 
function of the states; 3) low-dimensional systems as well as systems with high- 
dimensional measurements; 4) controllers with both tanh and sigmoid activations 
and with a number of neurons varying from 16 to 200 per layer. 

Table 1 presents the dynamics and the initial set for each benchmark. For 
simplicity, all properties are reachability properties (i.e., the problem is to verify 
whether a goal set is reached from all initial states), though safety properties 
can be handled as well. In particular, the goal regions are as follows: 


— By: xı € [0,0.2], z2 € (0.05, 0.3]; B2 : zı € [-0.3, 0.1], x2 € [-0.35, 0.5]; 


Table 1. List of benchmarks. Benchmarks Bı — B5 and Tora were introduced by Huang 
et al. [17]; adaptive cruise control (ACC) was presented by Tran et al. [28]; mountain car 
(MC), quadrotor with model-predictive control (QMPC) and F1/10 were introduced 
by Ivanov et al. [20]. We use V to denote the measurement dimension. In F1/10, y is 
a 21-dimensional LiDAR scan. 


Name Dynamics V | Initial set 
Bı T1 = 29,09 = UTZ — T1 2 | x1 € [0.8, 0.9], ro € [0.5, 0.6] 
Bo £1 = 22 — 23,0 =u 2 | x, € [0.7, 0.9], x2 € [0.7, 0.9] 
B3 wy = —21 (0.1 + (z1 + z2)?), 2 |æ € [0.8, 0.9], 
vg = (u + x1)(0.1 + (£1 + z2)?) z3 € [0.4, 0.5] 
Bä £1 = —T1 + T2 — T3, 3 | 21,23 E€ [0.25, 0.27], 
T2 xı(x3 + 1) — 22,73 = —a1 +u x2 € [0.08, 0.1] 
Bs zı = «23 — z2, 3 | x, E [0.38, 0.4], £2 € [0.45, 0.47] 
tg = 23,73 =u a3 E [0.25, 0.27], 
Tora Ti = 22, 4 | ax, E [—0.77, —0.75], 
fq = —21 + 0.1sin(a3), xə E€ [—0.45, —0.43], 
v3 = 24, x3 € [0.51, 0.54], 
Sa =u a4 E [—0.3, —0.28] 
ACC | & = 22, T2 = T3, T3 4-203 a 5 | x1 € [90,91], x2 € [32, 32.05] 
4 = @5,U5 = 16,26 = 2u — 2g ES x4 € [10,11], x5 € [30, 30.05] 
MC a} = £1 +209, 2 |æ € [—0.53, —0.5] 
z = x2 + 0.0015u — 0.0025cos(3z1) 
QMPC | x1 = z4 — 0.25, t2 = £5 + 0.25, 13 = £6 6 | x € [0.025, 0.05], 
v4 = 9.8lu1, 15 = —9.8lu2,76 = u3 — 9.81 x2 € [0, 0.025] 
F1/10 | #1 = x3cos(x4), t2 = xgsin(x4) 21 | zı E€ [—0.0025, 0.0025}, 
x3 1.63323 + 0.3266(u — 4), £4 zgton(u) x3 € [—0.0025, 0.0025] 
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- Bs: 21 € (0.2,0.3], z2 € [-0.3, —0.05]; By : xı € [-0.05, 0.05], z2 € [-0.05, 0]; 
— B;(sig) : xı € [—0.4, —0.28], x2 € [0.05, 0.22]; 

- Bs(tanh) : xı € [-0.43, —0.38], 2 € [0.16, 0.18]; 

— Tora: x1 € |[—0.1, 0.2], z2 € [—0.9, —0.6]; 

- ACC: 21 € [22.81, 22.87], x4 € [29.88, 30.02]; 

- MC: xı > 0.45; QMPC: z1, £2, £3 € [—0.32, 0.32]; F1/10: no crash [18]. 


7 Experiments 


We compare our tool, named Verisig 2.0, against three state-of-the-art tools, 
namely Verisig [19,20], ReachNN* [11,17], and NNV [27,28]. We selected these 
tools because they handle NNs with sigmoid/tanh activations. For each bench- 
mark, we record whether each tool could verify the property (or return Unknown 
due to large approximation error). In addition, we compare the verification times 
between the different tools. While Verisig and NNV do not support parallel exe- 
cution,? ReachNN* has been optimized for GPU execution, so a comparison in 
terms of verification times is fair (all experiments were run on an Intel Xeon 
Gold 6248 running at 2.5 GHz and with an Nvidia GeForce RTX 2080 Ti GPU). 
Finally, we provide a comparison in terms of reachable sets. 

Verification outcomes and times are reported in Table 2. Multiple controllers 
were used in some benchmarks in order to test a variety of NNs. We present the 


Table 2. Verification evaluation. The notation tanh/sig (n x k) indicates a NN with 
tanh/sig activations, n hidden layers and k neurons per layer. For each tool, we provide 
the verification time in seconds; if a property could not be verified, it is marked as 
Unknown. If a tool crashed on a benchmark, it is marked as DNF. 


Name | NN setup Verisig 2.0 (40 cores) | Verisig 2.0 (1 core) | Verisig ReachNN* | NNV 
Bı tanh (2 x 20) | 38s 48s DNF Unknown | Unknown 
sig (2 x 20) 40s 49s Unknown | 69s Unknown 
Bog tanh (2 x 20) Unknown Unknown Unknown | Unknown | Unknown 
sig (2 x 20) 6s 8s 12s 32s Unknown 
B3 tanh (2 x 20) | 32s 43s 98s 128s Unknown 
sig (2 x 20) 36s 47s 98s 130s Unknown 
B4 tanh (2 x 20) | 9s lls 23s 20s DNF 
sig (2 x 20) 10s 12s 24s 20s DNF 
Bs tanh (3 x 100) | 48s 168s Unknown | Unknown | Unknown 
sig (3 x 100) 51s 196s 1063s 31s Unknown 
Tora tanh (3 x 20) | 43s 70s 134s 2524s Unknown 
sig (3 x 20) 50s 83s 136s 3402s Unknown 
ACC tanh (3 x 20) | 529s 1512s Unknown | DNF Unknown 
MC sig (2 x 16) 48s 52s 33s N/A N/A 
sig (2 x 200) 1241s 4311s Unknown | N/A N/A 
QMPC | tanh (2 x 20) | 636s 697s 703s N/A N/A 
F1/10 | tanh (2 x 64) | 3411s 3654s 2021s N/A N/A 


3 NNV is parallelized in the case of ReLU activations, but not for smooth activations. 
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results for Verisig 2.0 as used with one and with 40 cores, in order to illustrate 
the benefit of parallelization. Note that parallelization helps the most in systems 
with wider NNs, e.g., the MC benchmark, since a larger part of the computation 
is devoted to NN computation (relative to plant computation) in these systems. 


Comparison with Verisig. Verisig is the closest method to Verisig 2.0, as it also 
propagates TMs through the NN. Thus, Verisig can be seen as a baseline for our 
approach, so this comparison illustrates most clearly the benefits of precondi- 
tioning and shrink wrapping. Firstly, note that Verisig takes significantly more 
time to compute reachable sets (21 times slower in the case of the Bs; benchmark) 
on all but one benchmark — the MC benchmark is peculiar because the NN is 
very small, hence most of the computation is spent on the plant. Furthermore, 
Verisig is unable to verify some properties due to increasing error. As shown 
in Fig. 4, the reachable sets computed by Verisig introduce more approximation 
error, especially in the challenging ACC benchmark, where preconditioning is 
particularly useful due to the larger input space. 


30.2 06 
, 
30.1 0.5 f ji 
0.4 T | 
30 |) 
= 08 F 
29.9 ö "il 
0.2 ii 
29.8 gi -0.02 i 
29.7 0 -0.04 
24 26 28 30 32 -0.5 0 0.5 -1 -0.5 0 0.5 1 
X4 x xi 
(a) ACC benchmark. (b) Bs benchmark, sigmoid. (c) MC benchmark, 2 x 200. 


Fig. 4. Comparison between the reachable sets produced by Verisig (blue) and Verisig 
2.0 (green). Example simulated trajectories are plotted in red. The goal set is shown in 
magenta. Note that the goal is not reached in the B5 benchmark. (Color figure online) 
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(a) Bı benchmark, sigmoid. (b) Bs benchmark, tanh. (c) Tora benchmark, sigmoid. 


Fig. 5. Comparison between the reachable sets produced by ReachNN* (blue) and 
Verisig 2.0 (green). Simulated trajectories are plotted in red (not shown in the Tora 
benchmark to improve visibility). The goal set is shown in magenta. (Color figure 
online) 
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24 26 28 30 32 
x, x, x, 


(a) ACC benchmark. (b) Bz benchmark, sigmoid. (c) Tora benchmark, tanh. 


Fig. 6. Comparison between the reachable sets produced by NNV (blue) and the Verisig 
2.0 approach (green) on three of the benchmarks from Table2. Example simulated 
trajectories are plotted in red. The goal set is shown in magenta. (Color figure online) 
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Fig. 7. Verisig 2.0 remainder growth (for position, xı) on the MC benchmark as we 
increase the NN size. The remainder is reset to 0 after shrink wrapping. 


Comparison with ReachNN*. ReachNN* is a sampling-based approach to NN 
verification, so it is expected to work well on low-dimensional systems and 
encounter difficulties as the dimension increases. As can be seen in Table2, 
Verisig 2.0 is faster on all but one benchmark, and the difference is especially 
pronounced on the four-dimensional Tora benchmark, where ReachNN* is 268 
times slower. Note that ReachNN* cannot handle hybrid models, so no compar- 
ison could be made on those benchmarks. Finally, as shown in Fig. 5, Verisig 
2.0 also results in tighter reachable sets — the benefit of shrink wrapping can be 
observed in Fig. 5a, where the ReachNN* reachable sets eventually start to grow 
fast whereas Verisig 2.0 is able to maintain low approximation error over time. 


Comparison with NNV. Note that NNV is unable to verify any of the properties 
considered in this paper due to high approximation error. This is mostly due to 
the fact that NNV is optimized for networks with ReLU activations, where the 
star set method used in NNV is effective and parallelizable. Figure 6 shows the 
reachable computed by each tool, where it is clear that Verisig 2.0 maintains 
tight reachable sets whereas the NNV approximation error grows quickly. 
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Scalability Evaluation. Finally, we also evaluate the scalability of Verisig 2.0 as 
we increase the NN size on the MC benchmark. Figure7 illustrates how the 
remainder grows over time for the x; (position) state. We observe that the 
larger NN results in significantly larger remainder growth. At the same time, 
interpreting the remainder growth in isolation may be misleading since it also 
depends on the size and shape of the true reachable sets. We leave a rigorous 
analysis of the effect of NN size on scalability for future work. 


8 Conclusion 


This paper presented Verisig 2.0, a parallelized tool for NN verification. We devel- 
oped a Taylor-model-based approach in which we reduce the approximation error 
in reachable sets through Taylor model preconditioning and shrink wrapping. 
Finally, we provided an extensive evaluation over 10 benchmarks and showed 
that our method is significantly more accurate and faster than state-of-the-art 
tools, resulting in 21x and 268x speed-ups on some benchmarks, respectively. 
For future work, we will investigate which NN architectures are more amenable 
for verification, both in terms of size and number of layers as well as in terms of 
weight magnitude and direction. 
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Abstract. This paper introduces robustness verification for semantic 
segmentation neural networks (in short, semantic segmentation networks 
[SSNs]), building on and extending recent approaches for robustness ver- 
ification of image classification neural networks. Despite recent progress 
in developing verification methods for specifications such as local adver- 
sarial robustness in deep neural networks (DNNs) in terms of scalability, 
precision, and applicability to different network architectures, layers, and 
activation functions, robustness verification of semantic segmentation has 
not yet been considered. We address this limitation by developing and 
applying new robustness analysis methods for several segmentation neu- 
ral network architectures, specifically by addressing reachability anal- 
ysis of up-sampling layers, such as transposed convolution and dilated 
convolution. We consider several definitions of robustness for segmenta- 
tion, such as the percentage of pixels in the output that can be proven 
robust under different adversarial perturbations, and a robust variant of 
intersection-over-union (IoU), the typical performance evaluation mea- 
sure for segmentation tasks. Our approach is based on a new relaxed 
reachability method, allowing users to select the percentage of a num- 
ber of linear programming problems (LPs) to solve when constructing 
the reachable set, through a relaxation factor percentage. The approach 
is implemented within NNV, then applied and evaluated on segmenta- 
tion datasets, such as a multi-digit variant of MNIST known as M2NIST. 
Thorough experiments show that by using transposed convolution for up- 
sampling and average-pooling for down-sampling, combined with mini- 
mizing the number of ReLU layers in the SSNs, we can obtain SSNs with 
not only high accuracy (IoU), but also that are more robust to adver- 
sarial attacks and amenable to verification. Additionally, using our new 
relaxed reachability method, we can significantly reduce the verification 
time for neural networks whose ReLU layers dominate the total analysis 
time, even in classification tasks. 
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1 Introduction 


Image segmentation is the process of partitioning an image into multiple por- 
tions, or segments, which are sets of pixels, and in short is referred to as seg- 
mentation [30]. Segmentation has broad applications, ranging from perception in 
autonomous cyber-physical systems (e.g., identifying pedestrians, lanes, vehicles, 
etc. in images) and medical imaging (e.g., identifying tumors, measuring tissue, 
etc. in X-rays and other medical scans) [31]. Semantic segmentation additionally 
classifies each pixel into a class from a set of classes, and hence, can be viewed as 
a generalization of image classification, the robustness of which has been studied 
deeply in recent years. 

State-of-the-art segmentation approaches typically rely on neural networks, 
known as semantic segmentation networks (SSNs). Typically SSN architectures 
take an image as input and are composed of two major portions: a sequence 
of down-sampling layers to extract features from the input image into a latent 
space, followed by another sequence of up-sampling layers, which in essence 
map the features (roughly corresponding to the classes) from the latent space 
to the image’s pixels, such that each pixel is associated with a class. However, 
just as neural networks for image classification are well-known to be vulnera- 
ble to adversarial perturbations, so too are SSNs [45]. Although deep neural 
networks (DNN) verification is emerging into an established research area with 
many tools and techniques proposed to verify safety and robustness specifica- 
tions of DNNs [22,43] and neural network controlled systems [15, 17,34,37], most 
state-of-art verification techniques for robustness verification of DNNs focus on 
variants of classification! , frequently for images [1,5,7, 11,19, 24, 26,29, 32, 33, 46]. 

To our knowledge, there are no existing methods that can verify robustness 
of SSNs, which perform a more complex task than image classification, as the 
output space dimensionality is (typically) of the same order of size as that of 
the input space (e.g., the output is an image with the width and height of the 
input image, but with identified classes in the output instead of color bit depth; 
see Figs. 5 and 7 for examples). We review some existing testing-based robustness 
evaluation methods in our related work section. 


Overview and Contributions. In this paper, we present the first formal approach 
for verifying SSN robustness using reachability analysis. Our approach’s central 
idea is, if an input image is attacked (perturbed) with some bounded distur- 
bance, we construct a reachable output set that contains all possible classes for 
each pixel. From the reachable output set, we can formally guarantee an SSN’s 
robustness at the pixel-level, i.e., each pixel is provably classified correctly. Our 
approach focuses on two effective SSN architectures, including dilated CNNs 
and transposed CNNs, which to our knowledge, are not supported in any other 
existing neural network verification approaches. We evaluate our approach on a 


1 The ACAS-Xu benchmarks [18] common in neural network verification can be viewed 
as a form of classifier: the networks produce advisories (weak left, etc.), which in 
essence are classes, but do not have images as their inputs. 
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set of SSNs trained with different architectures on the MNIST [21] and M2NIST 
data sets, the latter of which is a multi-digit variant of MNIST suitable for seg- 
mentation evaluation. Additionally, we define and evaluate several metrics for 
robustness, as the robustness evaluation is more sophisticated for segmentation. 

Our reachability-based approach builds on ImageStars, which are an efficient 
data structure for verifying convolutional neural networks (CNNs) [33], to con- 
struct the input set and compute the reachable set layer-by-layer throughout 
the SSN. The ImageStar approach offers both exact and approximate reachabil- 
ity schemes for analyzing the robustness of CNNs. Although the approximate 
scheme obtains a tighter reachable set in comparison with the zonotope [28] and 
new polytope methods [29] by using optimized ranges, in practice, we do not 
need a tight reachable set in many cases. Indeed, we only need a “tight enough” 
reachable set to verify a property. Therefore, it is reasonable to let users have the 
freedom to choose an appropriate level of relaxation in constructing the reach- 
able set for their applications. More relaxation comes with a coarser reachable 
set and vice versa. To fulfill this need, we also present a new relaxed ImageS- 
tar approach to allow users to choose a specific relaxation level defined by a 
relazation factor (RF) percentage when constructing the reachable set for their 
applications. This relaxed reachability method can help reduce the verification 
time of SSNs significantly (up to 99%) in some cases. 

In summary, the main contributions of this paper are: 1) the first formal 
approach for robustness verification of SSNs, 2) a new relaxed ImageStar reach- 
ability method, 3) the implementation of the approach in a prototype software 
tool, 4) thorough assessment of these methods on different network architectures, 
and 5) insight on how to train robust SSNs that are amenable to verification. 


2 Preliminaries and Problem Formulation 


2.1 ImageStars 
In this section, we review the ImageStar data structure and its properties [33]. 


Definition 1. An ImageStar O is a tuple (c, V, P,l,u) where c € R’*¥*"¢ js 
the anchor image, V = {v1, v2, +: ,Um} is a set of m images in R?*”*"© called 
generator images, P : R™ — {T, 1} is a predicate, l and u are the lower bound 
and upper bound vectors of the predicate variables, and h,w,nc are the height, 
width, and number of channels of the images, respectively. The generator images 
are arranged to form the ImageStar’s h x w x nc x m basis array. The set of 
images represented by the ImageStar is given as: 


[O] = {x | x = c+ X; (aivi) such that P(ay,-++: ,a@m) = T, li < ai < ui}. 


We may refer to both the tuple O and the set of states [O] as O. In this work, we 
restrict the predicates to be a conjunction of linear constraints, P(a) = Ca < d 
where, for p linear constraints, C € R?*™, aœ is the vector of m-variables, i.e., 
a = [a4,°** ,Qm]", and d € R®*!. An ImageStar is the empty set if and only if 
P(a) subject tol < a < u is empty. 
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Lemma 1 (Affine mapping of an ImageStar). An affine mapping of an 
ImageStar O = (c,V,P,l,u) with a scale factor y and an offset image B is 
another ImageStar O' = (d, V', P’,l',u’) in which the new anchor, generators 
and predicate are as follows: 


Jd=7-c+6B, V=y-V, P=P, V=l, w.=u. 


Note that, the scale factor y can be a scalar or a vector containing scalar scale 
factors in which each factor is used to scale one color channel in the ImageStar. 


2.2 Range of a Specific Input in an ImageStar 


We slightly alter the original definition of an ImageStar, [33], by introducing 
lower bound and upper bound vectors to the predicate variables. Specifically, if 
we want to find the range of an input z(i, j,k) (where 1 <i<h,l<j<u, 
1 < k < nc) in an ImageStar O, we need to solve the following LP problem. 


Tmin = min(c(i, j,k) + LoL aivp(t,j,k)) s.t. Ca < d,l < a < u, (1) 
Tmas = max(c(i, j, k) + Lot aivp(t,j,k)) s.t. Ca < d,l <Q <u. (2) 


However, if we only want to estimate roughly the range of the neuron with- 
out solving the LP optimization problem, we can compute the estimated range 
quickly as follows. 
sn = c(i, J, k) Tv Xp=1lp max(vp(i, J, k), 0) ae viUq min(vq(i, J, k), 0), (3) 
gest = c(i, j, k) + ap Up max(vp(i,j,k),0) + Xeila min(v4(i, j, k), 0). (4) 


2.3 Semantic Segmentation Networks and Reachability 


Definition 2. A semantic segmentation network (SSN) f is a nonlinear func- 
tion that maps each pixel x(i, j) of a multichannel input image x to a target class 
y(t, j) from a set of classes L = {1,2,...,L}: 
f rE phxwxne > yE [PXW 
z(i, j) > y(i, j) € £, 


(5) 


where h,w,nc are the height, width, and number of channels of the input image, 
respectively, and (i,j) € {1,...,h} x {1,...,w} are the pizel height and width 
indices, respectively. 


Definition 3. Reachability analysis (or shortly, Reach) of a SSN f on an 
ImageStar input set I is the process of computing all possible classes correspond- 
ing to every pixel in all input images x in the ImageStar input set I: 


Reach(f,I): I> Ry 


(6) 
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We call Ry(I) the pixel-class reachable set of the SSN corresponding to the 
input set I (or just Rye when I is clear from context), in which each pizel-class 
peli, j) E Rẹ at each pizel (i, j) € {1,...,h} x {1,...,w} may contain more than 
one class, i.e., peli, j) = {hi,.--,lm} CL, for L>m>1. 


2.4 Adversarial Attacks and Robustness 


Definition 4. An adversarial attack is where a set of n noise images Lnoise 
= [rote 2, grise] and corresponding coefficient vector e = [€1,...,€n|7 are 
added to input image x to change the classification result of a network. 

Mathematically, an adversarial attack is a linear parameterized function 
Je,nnoise(+) that takes an image as an input and produces the corresponding adver- 
sarial image. 


go = ge gnoise(£) =o + geal (7) 


In this paper, we focus on the robustness analysis of SSNs under adversarial 
attacks. We refer readers to [45] for a survey of state-of-art attack and defenses 
approaches, mostly for classification. 


Definition 5. An unknown, bounded adversarial attack (UBAA) is an adver- 
sarial attack where the value of the coefficient vector € is unknown but bounded 
in a range [e,€], i.e., ci < ei < €i. An UBAA can be defined formally as a tuple 
A = le, Z, groen, 


Proposition 1 (UBAA as an ImageStar). Applying an UBAA A = (6, €, 
xr'se) on an image x creates a set of images, which can be represented as an 
ImageStar I = (c= z, V = 2", P(a) = P(e) =e <€<8@). 


Definition 6. Given a SSN f and an input image x, a pixel x(i,j) € x is called 
robust to an UBAA A if and only if: Y ge znoise € A, f(a%’(i,7)) = f(x(i,3)), 
where <°% (i, j) E€ 1°% = ge gnoise(x). IfI ge qnoise E A such that f(x? (i, j)) A 
f(x(t,7)), the pixel x(t, j) is called non-robust. 


Definition 7. The robustness value (RV) of a SSN corresponding to an UBAA 
applied to an input image is defined as RV = pant x 100%, where Nrobust 18 


the total number of robust pixels under the attack, and Npixeis = h-w is the total 
number of pixels of the input image. 


Definition 8. The robustness sensitivity (RS) of a SSN corresponding to an 
UBAA applied to an input image is defined as RS = NrenrebusitNunknown where 


attackedpiaels 


Nnonrobust is the total number of non-robust pixels under the attack, Nunknown 
is the total number of pixels whose robustness is unknown (may or may not be 
robust), and Nattackedpizels is the total number of attacked pixels of the input 
image. 


Definition 9. The robust IoU (Intersection-over-Union) (Rrou) of a SSN cor- 
responding to an UBAA applied to an input image is defined as the average 
IoU of all labels that are robust under the attack. Let x be a segmentation 
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ground-truth image, y be the verified segmentation image under the attack, and 
IoU, be the IoU (also known as Jaccard index) of the p‘” label in the label images 
x and y, then the robust IoU of the SSN is computed by: 


a7 lols 


(8) 


Ry oU = 

The robust IoU definition is quite similar to traditional IoU, which is a core 

metric to evaluate the accuracy in training SSNs. However, instead of assessing 

the accuracy, we use the robust IoU concept in combination with the robustness 

value and robustness sensitivity as core metrics to evaluate the robustness of a 
SSN under adversarial attack in the verification context. 


2.5 Robustness Verification Problem Formulation 
We consider two robustness verification problems. 


Problem 1. Given a SSN f, an image x, and an UBAA A, prove for every pixel 
x(i, j) € x that x(i,7) is robust or non-robust to the attack A. 


Problem 2. Given a SSN f, a set of N test images {x1,..., £N}, and an UBAA 
A, compute the average robustness value RV, the average robustness sensitivity 
RS, and the average robust IoU Rroy of the SSN (corresponding to A). 


The core step in solving these problems is to prove the robustness of a SSN f 
under an UBAA A at the pixel-level, i.e., Problem 1, which can be solved using 
reachability analysis computing the “pixel-class reachable set” R» = Reach(f, I) 
that contains all possible classes of every pixel in the input set J constructed by 
applying the attack A on an image x (Proposition 1). Next, we investigate a new 
relaxed ImageStar reachability method for the ReLU layer, the up-sampling layers, 
including transposed convolution, dilated convolution, and pixel-classification. We 
note that the softmax layer can be neglected in the analysis [33]. 


3 Reachability of SSNs Using Relaxed ImageStars 


In this section, we build on the original ImageStar method to develop reach- 
ability analysis for the transposed convolution and dilated convolution layers, 
and propose a new relaxed ImageStar reachability method for the ReLU and 
pixel-classification layers. The reachability algorithms for other layers can be 
handled using existing methods, such as those in [33]. Thus, we highlight han- 
dling the up-sampling layers, which requires overcoming significant challenges, 
and has not previously been done. Handling up-sampling layers is necessary for 
SSN robustness verification. 


3.1 Reachability of a Transposed (Dilated) Convolutional Layer 


Transposed (dilated) convolutions are frequently used for up-sampling in image 
segmentation applications to generate an output feature map that has a spatial 
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1. Calculate 


2. Insert z zeros 3. Add p’ number of 4. Slide the kernel 1 
Input Kernel between the rows and zeros around the pixel at a time across Output 
the columns image the image 


Fig. 1. An example of a transposed convolution operation. 


ReLU(I) =? I =c + av, + a2v2,a = (@, &2)",P = Ca <d, Sa <u 
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02|  2-at+a 14) BA č | 


Fig. 2. Example 1. 


dimension greater than that of the input feature map. A transposed convolution 
operation consists of four main steps, depicted in Fig. 1, and is defined by its 
kernel size k, padding p, and stride s. A dilated convolution operation is defined 
by its kernel size k, padding p, stride s and dilation factor d. 


Lemma 2. The reachable set of a transposed (dilated) convolutional layer with 
an ImageStar input set T = (c,V,P) is another ImageStar, specifically T' = 
(c,V’,P) where € = TConv(c) (€ = DConv(c)) is the transposed (dilated) 
convolution operation applied to the anchor image, V’ = {v},..., Uhn}, v = 
TConvZeroBias(u;) (v; = DConvZeroBias(v;)) is the transposed (dilated) con- 
volution operation with zero bias applied to the generator images, i.e., using only 
the weights of the layer. Each of these are affine operations, see [30] for details, 
and as shown in Lemma 1, ImageStars are closed under affine operations.” 


3.2 Relaxed Reachability of a ReLU Layer 


In this section, we present the relaxed ImageStar reachability of a ReLU layer. 
Like the original approximate reachability method [33], the relaxed ImageStar 
approach computes an overapproximate reachable set of a ReLU layer. However, 


2 In most neural network frameworks, transposed and dilated convolution are imple- 
mented as convolution with particular choices of padding, stride, and dilation factor 
as illustrated in Fig. 1 for transposed convolution, which is well-known to be affine. 
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it allows users to construct a “tight enough” reachable set sufficient to prove 
properties for their applications via a user-specified relaxation factor scaled from 
0% to 100% that reduces verification time. In this paper, we focus on this pro- 
cess for ReLU layers. We use a small example depicted in Fig.2 to illustrate 
the reachability of a ReLU layer using the relaxed ImageStar method. In this 
example, we have a 2 x 2 (4 neurons) ImageStar input set J with the anchor 
image c and two generator images vı and v2, and we want to compute an over- 
approximation of ReLU (I). To do that, we apply the triangle overapproximation 
rule [10,36] for the ReLU activation function at each neuron of the input set in 
the following. 


Lemma 3. For any input x € |l, uJ, the output set Y = {y| y = ReLU(a)} 
satisfies: (1) Ifl > 0, then y = x; (2) Ifu <0, then y = 0; or (3) Ifl <0 and 
u>O0, then Y CY = {y| y>0, y < eD, y> az}. 


Using the predicate variable’s bounds, we can quickly estimate the ranges of 
all neurons in the ImageStar set in Fig. 2 without solving any linear programming 
(LP) optimization problems (by using Eq. 3). From the estimated ranges, we see 
ReLU (n21) = 0 (n21 < 0) and ReLU (n22) = 2—a, +a2 (N22 > 0). Therefore, to 
overapproximate ReLU (I), we need only perform the overapproximation rule on 
neurons 71; and n12, which is where the user-defined relaxation can be applied. 
In the original approximate reachability approach [33], we use the exact ranges 
to construct the triangle overapproximation of the ReLU activation function, 
which requires solving 4 LPs to find the exact ranges for ni, and ni2, which 
are [—0.5,1.5] and [—1,1] respectively in this example. Now, if users want to 
reduce the number of LPs solved in constructing the overapproximate reachable 
set to speed up verification, which LPs should be chosen to solve to construct a 
sufficiently tight overapproximate reachable set? For example, if the users want 
to relax 50% number of LPs for Example 1, then only 4— (50% x 4) = 2 LPs are 
solved to construct an overapproximate reachable set. So, which two LPs should 
be chosen? 

The answer is found by combining the exact ranges obtained by solving LPs 
and the estimated ranges to construct the overapproximate reachable set. This 
can be done using on of the following heuristic approaches. These approaches 
select which neurons and their corresponding lower (upper) bounds should be 
obtained exactly to construct an as-tight-as-possible overapproximate reachable 
set with a given allowable number of LPs. Some of these heuristic approaches 
are based on the estimated ranges information. 


3.2.1 Randomly Relaxed Reachability 

This approach randomly selects some LPs in the LPs pool to solve to obtain the 
lower (upper) bounds for some (random) neurons. For Example 1, the LPs pool 
is as follows. 


LPpoot = {min(n11), max(n41), min(n22), max(n2), 
subject to: P=Ca<dl<a< u}. 
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Fig. 3. Overapproximation areas at neurons n11 and nı2 using estimated ranges. 


If users relax 50% of the LPs, then the randomly relaxed reachability algorithm 
selects aimlessly two LPs in the LP pool to solve, and then combines the obtained 
lower (upper) ranges with the estimated ranges to construct an overapproximate 
reachable set using the triangle overapproximation rule, i.e., Lemma 3. 

From Fig. 2, we can see that the estimated lower ranges of neurons n1, and 
nız are the same as the exact ones. Therefore, if the randomly relaxed reacha- 
bility algorithm selects min(n11) and min(n12) to solve, the final ranges used for 
constructing the reachable set exactly match the estimated ranges. This means 
solving min(n11) and min(ni2) wastes time and does not reduce the conserva- 
tiveness of the overapproximate reachable set, as no tighter ranges are obtained. 
In another case, if the algorithm selects max(nj,) and max(nj2), then we can 
obtain the exact ranges of two neurons by solving only two LPs (instead of four 
LPs), when combining the estimated lower ranges, i.e., —0.5 for nj, and —1 
for nı2 with the optimized upper ranges, i.e., 1.5 for ny, and 1 for nj. In this 
case, the randomly relaxed algorithm can obtain the tightest overapproximate 
reachable set by solving only 50% of the LPs. 


3.2.2 Area-Based Relaxed Reachability 

The area-based relaxed reachability approach finds and optimizes the ranges of 
neurons with the potentially largest triangle overapproximation areas. Figure 3 
illustrates the areas of the triangle overapproximation at neurons n1; and 749 
using the estimated ranges. We see the overapproximation area of n12 (Snia = 
0.75) is larger than that of ni, (Sp,, = 0.625). Therefore, if users relax 50% 
of the LPs, the area-based relaxed reachability algorithm will use two LPs to 
optimize the range of neuron n12, i.e., solving min(nı2) and max(nı2). With this 
optimized range, the overapproximation area of the neuron ni2 reduces from 
Sri» = 0.75 to Sna = 0.5. If users relax 75% of the LPs, then the algorithm will 
use two LPs to optimize the range of the neuron n12 and one LP to optimize 
the upper bound of the neuron 11, because ti11 = 2.5 > |l] = 0.5. 


3.2.3 Range-Based Relaxed Reachability 
The range-based relaxed reachability approach finds the neurons with the 
potentially widest ranges to optimize their ranges. For Example 1, unlike the 


272 H.-D. Tran et al. 


area-based approach, the range-based approach will use two LPs to optimize the 
range of neuron 11, i.e., solving min(n;) and max(n11), whose estimated range 
(ER) is widest (ER,,, = [ŭn — ln,,| = 12.5 — (—0.5)| = 3 > ERn = 2.5). 
After optimizing the range of neuron n11, the overapproximation area at this 
neuron reduces from Snai = 0.625 to S,,, = 0.375. The improvement in terms 
of overapproximation area reduction of the range-based method is equivalent to 
the above area-based approach in this case, i.e., AS,,, = AS,,, = 0.25. 


3.2.4 Bound-Based Relaxed Reachability 

The bound-based relaxed reachability approach finds neurons with the poten- 
tially largest (lower or upper) bounds to optimize their bounds. For Example 
1, the algorithm will use two LPs to optimize the upper bounds of the neurons 
nıı and nız, i.e., solving max(n11) and max(n12), because their estimated upper 
bounds are the ones with largest absolute values. Thus, |n, | = 2.5 > |tn..| = 
1.5 > [Ino] = 1 > |ln,,| = 0.5. After optimizing these upper bounds, the overap- 
proximation areas at neurons 71; and 712 reduces to 0.375 and 0.5 respectively. 
In this case, we can see that the bound-based relaxed approach is the best app- 
roach compared to the others since it reduces the overapproximation errors at 
both neurons nı and n12, effectively reducing the overapproximation areas by 
ASn = ASp,. = 0.25. It is worth noting the obtained overapproximate reach- 
able set is the same as the one obtained by the original approximate ImageStar 
reachability because the estimated and optimized lower bounds are the same. 


3.3 Reachability of a Pixel-Classification Layer 


The last layer in an SSN is a pixel-classification layer, which assigns a specific 
class (label) to each pixel of an input image. Given an h x w x nc input image, 
the size of the input x to the pixel-classification layer is h x w x L, where L 
is the number of classes (labels) of the network (we neglect the softmax layer 
in the analysis). To assign a specific class 1, 1 <1 < L to a pixel z(i, j) € x, 
1<i<h,l<j<_w, the value of the pixel z(i, j) at channel J, i.e., x(i, 7,1), 
needs to be the maximum one among L channels. When the input to the network 
is an ImageStar set instead of a single image, the input to the pixel-classification 
layer is a h x w x L ImageStar set. Depending on the value of the predicate 
variables in the input set, a pixel (7,7) in the set may be assigned to more than 


one class. For example, if [),...,l are the cross-channel max-point candidates 
of the pixel z(i,j) in L channels, the pixel-class reachable set of the layer at 
the considered pixel is pc(i,7) = {h,...,lm}. By determining all cross-channel 


max-point candidates of all pixels in the input set, we can obtain the pixel- 
class reachable set of the layer, which is also the reachable set of the SSN, 
Ry = [pc(i,7)|nxw, i.e., the collection of pixel classes at every index (i, j). 
Similar to the max-pooling layer [33], determining all cross-channel max- 
point candidates of all pixels in the input set can be done via solving linear 
programming (LP) optimization problems, which is time-consuming due to the 
number of LPs required (or equivalently the size of the LP). To reduce compu- 
tation time, we estimate the lower and upper bounds of the ImageStar input to 
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the layer using only the ranges of the predicate variables. These bounds are then 
used to predict all possible cross-channel max-point candidates of all pixels. 


4 Verification Algorithm 


Our reachability-based verification algorithm for SSNs is presented in Algorithm 
4.1. The algorithm takes an SSN f, an input image x, an UBAA A, and a 
reachability method (exact or approximate) as inputs, then returns the pixel- 
class reachable set Rs, the robustness value RV, sensitivity RS, and robust IloU 
Riou of the SSN. The algorithm works as follows. First, it constructs the input 
set corresponding to the attack using Proposition 1 (line 2). Then, it computes 
the pixel-class reachable set of the SSN using reachability analysis layer-by- 
layer (line 3). Using the pixel-class reachable set, it verifies the robustness of 
each pixel in the reachable set by comparing its classes with the non-attacked 
(ground truth) output segmentation image, i.e., y = f(x). If Re(i,7) = y(i, j), 
the pixel x(i, j) is robust under the attack (line 10). IER (i, j) F y(i, Ay j) É 
Ry(i,j), the pixel z(i, j) is non-robust under the attack (line 12). Otherwise, 
the robustness of the pixel x(i,7) is unknown (may be robust or non-robust), 
due to overapproximation. Beyond verifying the robustness of each pixel in the 
reachable set, it also counts the numbers of 1) robust pixels N,-obust (line 10), 2) 
non-robust pixels Nnonrobust (line 12), and 3) pixels with unknown robustness 
Nunknown (line 13). Finally, it computes the robustness value, sensitivity and 
robust IoU of the SSN (lines 12, 13 and 14). The robustness of a SSN under an 
UBAA should be evaluated on a set of test images (Problem 2). 


Algorithm 4.1. Robustness verification of a semantic segmentation network. 
Input: f; T, A, RF,method > SSN, input image, attack, relaxation factor, relaxation method 
Output: Rg, RV, RS p pixel-class reachable set, robustness value, robustness sensitivity 
procedure [R;, RV, RS] = veriry(f,x,A, RF, method) 
: I = constructInputSet(x,A) > construct an ImageStar input set 


1: 

2 

3 Re = Reach(f, I, method) > compute the pixel-class reachable set 

4 y= f(x) > compute non-attacked output segmentation image 

5: h = xz.Height, w = x.Width 

6: Nrobust = 0, Nnonrobust — 0, Nunknown = 0, Nattackedpizxels =0 
7 for i = 1 : h do 

8 for j = 1 : w do 


9: if A.x”°**(i,7) 40 then Nattackedpizels = Nattackedpizels + 1 

10: if R(t, j) = y(t, j) then Nrobust = Nrobust +1 > pixel a(i,j) is robust 

11: else 

12: if y(t, j) (A Reli, j) then Nnonrobust = Nnonrobust + 1 > pixel z(i, j) 
is non-robust 

13: else Nunknown = Nunknown +1 > pixel a(i,j) robustness is unknown 

14: RV = (Neobust/(h i w)) À 100%) > robustness value 

15: RS = (Nnonrobust + Nunknown)/Nattackedpixels > robustness sensitivity 


16: Riou = getAverageloU(y, Rf) > robust loU 
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5 Evaluation 


Experimental Setup. The approach is implemented in the NNV software tool for 
verification of deep neural networks*®. We evaluate our approach by verifying the 
robustness of a set of SSNs trained on the MNIST [21] and M2NIST datasets 
shown in Tablel, where class “ten” corresponds to the background, and the 
other classes to the corresponding digits. The experiments were performed on a 
computer with an Intel Core i7-6700 CPU at 3.4GHz with 8 cores and 64 GB 
Memory running Windows 10. The over-approximating reachability method and 
6 cores are used for computing the pixel-class reachable sets. 

We randomly selected 100 MNIST images (of size 28 x 28) and 100 M2NIST 
images (of size 64 x 84) to evaluate the robustness of the trained SSNs. We 
attack each image x in these two test sets using an UBAA brightening attack. 
Particularly, we darken a pixel x(i,j) in the image if its value is larger than 
a threshold d, i.e. if x(i,j) > d — 2°” (i,j) = a « d. Mathematically, the 
adversarial darkening attack on an image x can be described as: 


LO = g H e tere, fA. ec tT, 


g” (i, j) = —a(i, j), if e(i, j) >d, otherwise 2"*°(i, j) = 0. 


For e = 1, we completely darken all the pixels whose values are larger than d 
(=150 in our experiments), i.e., x“% (i, j) = 0. The size of the input set caused 
by the attack is defined by A.. Generally, we have a large input set when A, is 
large. To evaluate the average robustness values (RV) and sensitivities (RS) of 
the SSNs (on the test sets) in the connection with the number of attacked pixels, 
we further restrict the maximum allowable number of attacked pixels by Nmaz- 

We focus our evaluation and discussion on three aspects: 1) the robustness 
and sensitivity of different SSN architectures under adversarial attacks, 2) the 
effect of SSN architectures and input size on verification performance, and 3) 
the improvement of the new relaxed reachability method in terms of verification 
results and performance. For the first two aspects, we use the relaxed reachability 
method with relaxation factor RF = 0%, i.e., no relaxation, to construct the 
reachable sets of the SSNs. 


3 The examples and tool are available: https://github.com/verivital/nnv/tree/ 
cav2021/code/nnv/examples/Submission/CAV2021. An archival version is available 
on Zenodo: https://doi.org/10.5281/zenodo.4726346. 
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Table 1. Semantic Segmentation Network Benchmarks. Notation: ‘I’: input, ‘C’: con- 
volution, ‘TC’: transposed convolution, ‘DC’: dilated convolution, ‘R’: ReLU, ‘B’: batch 
normalization, ‘AP’: average-pooling, ‘MP’: max-pooling, ‘S’: softmax, ‘L’: label (pixel 
classification). 


ID | Name Accuracy(IoU) | Down-sampling | Up-sampling | Input size | Layers 

Nı | mnist_ap_te | 0.87 C+AP TC 28 x 28 21 (1I, 7C, 3R, 4B, 2AP, 2TC, 1S, 1L) 
N2 | mnist-mp-te | 0.85 C+MP TC 28 X 28 21 (1I, 7C, 3R, 4B, 2MP, 2TC, 1S, 1L) 
Ng | mnist_de 0.83 Cc DC 28 x 28 21 (11, 3C, 3R, 3B, 9DC, 1S, 1L) 

N4 | m2nist_ap_dc | 0.62 C+AP DC 64 x 84 16 (1I, 4C, 3R, 3AP, 3DC, 1S, 1L) 

Ns | m2nist-ap-tc | 0.75 C+AP TC 64 x 84 22 (1I, 7C, 8R, 2AP, 2TC, 1S, 1L) 

Ne | m2nist_de 0.72 Cc DC 64 x 84 24 (1I, 1C, 5R, 5B, 10DC, 1S, 1L) 
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Fig. 4. The average robustness value, sensitivity, and IoU of MNIST SSNs. 


5.1 Robustness and Sensitivity of Different Network Architectures 


Max-Pooling vs. Average-Pooling. Max-pooling is the preferred choice over 
average-pooling for training SSNs because of its nonlinear characteristics. We 
investigate whether max-pooling is actually better than average-pooling in terms 
of accuracy and robustness of deep SSN. Figure 4 illustrates the average robust- 
ness and sensitivities of MNIST SSNs under different numbers of attacked pix- 
els (Fig. 4a, 20 images are used) and input sizes (Fig. 4b, 10 images are used). 
We focus on the first two SSNs, i.e. N; and No. These SSNs have the same 
architectures (with 21 layers). The only difference is N, uses average-pooling for 
down-sampling while Nj uses max-pooling for the same task (both SSNs use two 
transposed convolutional layers for up-sampling). With training, we experienced 
that Nı is more accurate than No, (0.87 IoU vs. 0.85 IoU, see Table 1). Inter- 
estingly, Nı is also more robust than Nə since it has a larger average robustness 
value (Figs. 4a-a, 4b-a), a higher average robust IoU (Figs. 4a-c, 4b-c), and more 
robust pixels (Figs. 4a-d, 4b-d). One can also see that the average-pooling-based 
SSN is less sensitive to the attack than the max-pooling-based SSN (Figs. 4a-(b, 
e, f), 4b-(b, e, f)). Notably, when more pixels are attacked or larger input sizes are 
used, the max-pooling-based SSN (i.e., N2) produces more pixels with unknown 
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b) Re( (N2), Nunknown = = 19 ( Nmazx = = 50, Ae = 0. 003). 


Fig. 5. Example pixel-class reachable sets of MNIST SSNs. The max-pooling-based 
SSN N2 produces more unknown pixels than the average-pooling-based SSN N: (19 
vs. 6). 


robustness (Figs. 4a-f, 4b-f, and 5). Lastly, when the input size increases, the 
robustness of the max-pooling-based SSN drops more quickly than the average- 
pooling-based SSNs (Fig.4b (a,d)) and its sensitivity increases faster (Fig. 4b 
-b). We believe the main reason causing the max-pooling-based SSN to be more 
sensitive to the attack is its high nonlinearity using max-pooling layers. It is 
quite interesting that even the max-pooling-based SSN Nə has a higher accu- 
racy (0.85) than the non-max-pooling SSN N; (0.83), the average robust IoU of 
the SSN Ng is smaller than the one of N3 (Figs. 4a-c, 4b-c). 


Accuracy vs. Robustness; Deeper Networks and ReLU Layer Robust- 
ness. Accuracy (and for segmentation, IoU) is one of the most important factors 
for evaluating deep neural networks. We investigate whether more accurate and 
deeper SSNs are more robust compared to other architectures. To determine 
this, we analyze the robustness of two SSNs with different architectures and 
accuracy trained on the M2NIST data set. The first SSN N, is based on dilated 
convolution with 16 layers and 0.62 (IoU) accuracy (Table 1). The second SSN 
N; is based on transposed convolution with 22 layers and 0.75 (IoU) accuracy. 
Here, the second SSN is deeper and more accurate than the first SSN. We run the 
robustness analysis on these two SSNs on a set of 20 M2NIST images. The results 
are depicted in Fig. 6. In terms of robustness, the more accurate and deeper SSN 
N; is worse than the less accurate one N4 as it has a smaller average robustness 
value and IoU (Figs. 6-(a,c), 7). Additionally, N5 is also more sensitive to the 
attack than N, (Fig. 6-(b,e)) when we increase the number of attacked pixels. 
The main reason for this result is, the more accurate SSN contains many ReLU 
layers (8 ReLU layers) compared with the less accurate one (3 ReLU layers). 
Similar to the max-pooling layer, using many ReLU layers increases the nonlin- 
earity of the SSN to capture complex features of images. Unfortunately, it also 
makes the SSN more sensitive to the attack. 
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Fig. 6. The average robustness value, sensitivities, IoU, verification time (A, = 107°) 
and reachability times (blue for ReLU layers and orange for others, Ae = 6 x 107°) of 
M2NIST SSNs. (Color figure online) 
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Fig. 7. Example pixel-class reachable sets of M2NIST SSNs. The more accurate and 


deeper SSN Ns produces more non-robust pixels than the less accurate SSN N4 (51 vs. 
43). 


Dilated Convolution vs. Transposed Convolution. Dilated convolution 
and transposed convolution are typical choices for semantic segmentation tasks. 
We compare these techniques in terms of accuracy and robustness. On MNIST 
SSNs, although the transposed-convolution SSNs Nj, and the dilated-convolution 
SSN N3 have the same number of layers (21 layers with 3 ReLU), N3 is less 
accurate than Nj; (0.83 vs. 0.87 IoU, see Table 1). In terms of robustness, N3 
is also less robust and more sensitive to the attack than N4, as it has smaller 
average RV and IoU, and larger sensitivities (Fig.4). On M2NIST SSNs, by 
considering 21-layer (8 ReLU) transposed-convolution SSN N5 and 24-layer 
(4 ReLU) dilated-convolution SSN Ng, one can see that even with more lay- 
ers, Ng is less accurate than Ns (0.72 vs. 0.75 IoU, see Table 1). Also, Ng is less 
robust and more sensitive to the attack than Ns, since it has smaller average RV 
and IoU, and larger sensitivities (Fig. 6). 
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5.2 Verification Performance 


Dilated Convolution vs. Transposed Convolution. In general, more 
attacked pixels and larger input size leads to greater verification time, as depicted 
in Figs. 8a, 8b and 6b-(a). Interestingly, these show that the dilated-convolution- 
based SSNs require greater verification time than the ones using transposed con- 
volution. For example, the verification time of N3 is larger than Nə when they 
have the same number of layers. 


\ 
Reachability Time (s) 


1000 


o o o 
$ 25 3 0.005 0.01 0.005 0.01 0.005 0.01 
A. x10? A A A, 


(a) (b) (c) 


Fig. 8. Verification time is proportional to the number of attacked pixels and input 
size. The max-pooling-based N2 and dilated convolution-based N3 SSNs require more 
verification time than the average-pooling and transposed convolution-based SSN Nj. 
The reachability times of ReLU layers (blue) dominates the total reachability time 
(other layers reachability times are in orange). (Color figure online) 


Max-Pooling and ReLU Layers. Using max-pooling layer for down sampling 
not only decreases the robustness of an SSN but also causes a dramatic increase 
in time and memory consumption in verification. Figure 8 shows that the veri- 
fication time (in seconds) of the max-pooling-based SSN Nə grows significantly 
compared with the average-pooling-based SSN N, when increasing the number 
of attacked pixels Nattackedpizels Or the input size Ae. When dealing with more 
number of attacked pixels or larger input size, the max-pooling layer introduces 
more predicate variables to overapproximate the reachable set, which causes the 
increase both in computation time and memory usage [33]. Similar to the max- 
pooling layer, the ReLU layer is also the main source of robustness degradation. 
Additionally, it may also dominate the reachability time of a SSN, as shown in 
Fig. 8c. This leads to an increase in the verification time for SSNs with many 
ReLU layers. 


5.3 Reducing Verification Time with Relaxation 


When ReLU layer analysis dominates the total verification time significantly, as 
in the case of MNIST SSNs shown in Fig. 8c and not in the case of M2NIST 
SSNs depicted in Fig. 6b-(b), we can use the relaxed ImageStar reachability 
methods to speed up the verification process. Table2 presents the decrease in 
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the verification times in percentage when applying different relaxation heuristics 
for ReLU layers. We note that due to the small input size and a small number of 
attacked pixels, we do not see any changes in the robustness value, sensitivity, 
and IoU compared with the non-relaxation method, i.e., the original approximate 
ImageStar method. However, there is a significant improvement in verification 
time when we apply the relaxed ImageStar reachability for non-max-pooling 
SSNs N; and N3. More relaxation leads to a higher reduction in the verification 
time: up to 99% of the verification time can be reduced with 100% relaxation in 
the reachability of ReLU layers. 

Interestingly, using relaxation for the max-pooling-based SSN Nə decreases 
the verification performance, i.e., leading to higher verification time. The main 
reason is that the relaxed reachable sets after ReLU layers become increasingly 
conservative. At the max-pooling layer, a more conservative reachable set leads 
to more local max-point candidates that need to be determined via solving more 
LPs, which causes an increase in the verification time. Additionally, if a local 
region has more than one max-point candidate, a new predicate variable and its 
corresponding generator image are introduced [33]. The increase in the number 
of predicate variables and generator images causes the explosion in the memory 


Table 2. The relaxed ImageStar reachability methods can reduce significantly the 
verification time (in seconds) of MNIST SSN networks except for the one containing 
max-pooling layers, i.e., No. The maximum allowable number of attacked pixels is 
Nmaz = 50 for Ny and Nə and Nmaz = 20 for N3. 


ID | RF Ae & 
Rand Rand a Rand Area Bound 
000) 20.56 82.57 861.69 862.03 
0.25] 19 72.2(| 13 ) 978.1(| -13%) 
Ni | 0.50] 17.7(1 )  779.5(| 10%) 
0.75 | 17.0(, 16.7(| 14% 347.6(| 60%) — 389.2(| )  439.0( 49%) 
1.00 16.2(| 17%) 90.5(1 89%) 90.1(| 90%) 94.4( 87%) _92.9(| 89%) 
0.00 MemErr MemErr MemErr 
0.25 378.1( MemErr MemErr MemErr 
Na | 0.50 62.9(| —4: 481.0(| — MemErr MemErr MemErr MemErr 
0.75 | 72.9(| —62 l 5( —6i 306.0(| —9 l ) 448.5(] — MemErr MemErr MemErr MemErr 
1,00 |79.6(| —76%) 79.5(| -79%) 79.8(| -83%) 79.9(| —77%) |364.4(| —30%) 325.4(| —14%) 322.1(| -27%) 318.5(| — MemErr MemErr MemErr MemErr 
000| 119.63 118.74 120.66 1119.16 1116.85 996.56 1116.66 17699.81 17651.30 17260.00 17780.00 
0.25 100.4(} 15%) 95.3 %) | 920.7(1 18%) 1020.7(1 9%) 874.9(} 12%) 1157.0(} —4%) | 15474.4(| 13%) 17222.3(} 2%) 14700.0(| 15%) 17201.0(1 3%) 
Ns | 0.50 648.4(| 42%) — 759.4(| 32% T 797.8(| 29%) |11976.3 ) 14566.7(| 17%) 11902.0(| 14729.0(| 17%) 
0.75| 45 352.6(| 68%)  424.9(| 416.8(| 63%) | 6720.0(| 62%)  8556.8( 7217.0(| 5%) 
1.00 22.9(| 80%) 22.3(| 81%) | 47.6(1 96%)  45.7(} 96%)  45.7(1 95%)  45.2(1 96%) | 116.1(1 99%)  115.7(1 99%) _115.4(| 99%)  115.2(4 99%) 
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Fig. 9. The conservativeness of different relaxation heuristics. The area-based and 
range-based relaxation strategies outperform others in terms of conservativeness. 
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usage for the analysis. In the worst case, it can lead to a memory error as shown in 
Table 2. Therefore, it is important to have relaxation strategies for max-pooling 
layers, which will be investigated in our future work. 


5.4 Conservativeness of Different Relaxation Heuristics 


We have four relaxation heuristics that can be used in the reachability analysis 
of ReLU layers. The verification time improvement of these methods is quite 
similar, as shown in Table 2. It is interesting to see how good they are in terms 
of conservativeness. Unfortunately, we cannot see it clearly via verification of 
SSNs. Although increasing the number of attacked pixels and input size can 
eventually show the difference in conservativeness of these methods, it requires 
a more powerful computer with massive memory for verification. Therefore, to 
determine the best relaxation heuristic in terms of conservativeness, we evaluate 
image classification robustness that has been studied extensively recently, and 
illustrates the benefits of the relaxation method beyond SSN verification. We 
apply our four relaxation heuristics to verify robustness of an MNIST classifica- 
tion network [29] that is trained by the DiffAI robust training framework under 
the L..-norm attack, where all pixels of an input image are attacked indepen- 
dently by a bounded disturbance defined by et. The robustness of the network 
is quantified in percentage stating how many images of 100 randomly selected 
images are provably robust under the attack, i.e., classified correctly. 
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Fig. 10. When the relaxation factor (RF) < 0.5, the area-based relaxed reachability 
is less conservative than DeepZono [28] and DeepPoly [29]. It is also faster than these 
approaches when the disturbance is small, i.e., € < 0.11. 


Figure 9 illustrates the conservativeness of different relaxation methods. One 
can see that the area-based and range-based relaxation strategies consistently 
outperform others in terms of conservativeness since their provable numbers of 
robust images (in 100 images) under the different sizes of the Loo norm attacks 
are higher than others in all cases. Figure 10 illustrates the conservativeness and 
verification time of our area-based relaxed reachability (with different relaxation 


“ These benchmarks were used in VNN-COMP’20. 
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factors (RF)) in comparison with DeepZono [28] and DeepPoly [29]. In terms 
of conservativeness, the area-based relaxed reachability is better than DeepZono 
and DeepPoly when we choose a relaxation factor RF < 0.5. When the dis- 
turbance is large, DeepZono and DeepPoly may become very conservative. For 
example, when the disturbance bound € = 0.2, the only 5 and 14 (over 100) 
images are proved robust by DeepZono and DeepPoly, respectively. Meanwhile, 
without relaxation, i.e., relaxation factor RF = 0, the area-based relaxed reach- 
ability can prove 54 images are robust under the attack. It can prove robustness 
of 48 and 23 images when the relaxation factors are 0.25 and 0.5, respectively. In 
terms of verification time, when the disturbance is small, i.e., € < 0.11, the area- 
based relaxed reachability is faster than DeepZono and DeepPoly. It is slower 
than DeepPoly for larger disturbance (except for the case when the relaxation 
factor is 1). This increase in the verification time is apparent since DeepZono and 
DeepPoly do not solve any LPs for constructing the overapproximate reachable 
set of the network while our approach does. Due to using only estimated ranges 
of the neurons in constructing the reachable set, DeepZono and DeepPoly are 
overly conservative for a large disturbance, proving only a few images are robust. 
This reflects the fact that more computation time for optimization is needed to 
prove more images robust. 


6 Related Work 


To enable neural networks use in safety-critical scenarios, many methods have 
recently been proposed to improve their robustness and temper their susceptibil- 
ity to adversarial attacks. The following section surveys the landscape of these 
approaches in order to better contextualize our work. 


SSN Robustness. SSNs are used in visual understanding systems in numer- 
ous contexts, recent works aim to improve the robustness of these mod- 
els [13,20, 23,25], albeit none that provide worst-case guarantees, as our app- 
roach does. For instance, recent work develops rigorous testing-based approaches 
to evaluate the robustness of SSNs, considering a wide range of architectures, 
and offering an insightful discussion about the comparative robustness of these 
modalities against various adversarial attacks [2]. Kamann et al. conducted an 
extensive evaluation of a state-of-the-art SSN using over 400,000 images and 
issued a series of recommendations aimed at improving robustness to common 
perturbations. Zhou et al. presented an automated method for evaluating robust- 
ness of SSNs within visual systems for autonomous vehicles, which leverages an 
additional sensor to generate ground truth labels so that an examination of the 
classification accuracy of an SSN can be evaluated at runtime[47]. Robust train- 
ing techniques that incorporate image corruptions and architecture modalities 
have also been developed for SSNs [20]. Even though such works provide bet- 
ter understanding, potential defenses against adversarial perturbations, run-time 
evaluation, and comparative robustness measures, they cannot provide formal 
verification guarantees for SSN robustness as our work does. 
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Neural Network Verification and Falsification. The bulk of neural network verifi- 
cation approaches have been aimed at verifying input-output properties of DNNs. 
These methods include SMT [18,19], polyhedral [35,44], mixed integer linear pro- 
gramming (MILP) [9], interval arithmetic [38], zonotope [28], linearization[39], 
and abstract-domain [29] approaches. There have also been a number of works 
aimed at testing the robustness of networks with respect to bounded input per- 
turbations such as feature-guided search, global optimization, and game theory 
[16,42]. One such example is the work of Dreossi et al. where the authors pro- 
posed a general definition of robustness for DNNs [8]. Their work categorizes 
the existing literature into approaches that consider local robustness properties 
[6], and those that focus on verifying the global robustness of the networks [14]. 
Most of the existing research in this area focuses on robustness of classification 
neural networks, specifically image classification. While many approaches aim at 
verification, methods also exist for falsification of system specifications, in which 
robustness properties are included [12]. However, to the best of our knowledge, 
no existing approaches consider verification for SSNs, as we do in this paper. 


Sequence Model Verification and Robustness Analysis. Aside from classifica- 
tion tasks, there are several verification approaches for sequence models. Unlike 
SSN and classification networks, the output of sequence models such as recur- 
rent neural networks (RNNs) depends on spatially or temporally ordered data 
[4,41]. While some of these efforts are similar in spirit to our work in expanding 
the classes of problems and models for verification, the verification tasks and 
approaches differ. 


Scalability and Specifications. Finally, verification of DNNs is challenging, and 
presently the most complex networks remain inaccessible to the majority of 
methods. However, several recent approaches have focused on improving the effi- 
ciency of existing methods via parallelization and other techniques [3,35, 40]. As 
verification work is only meaningful when paired with high-quality specifications, 
there has been significant work on the importance of semantics when defining 
system specifications against adversarial attacks [27], and our paper contributes 
to this direction through our formulation of robustness specifications and metrics 
for segmentation tasks. 


7 Conclusion 


We present the first formal approach to verify robustness of SSNs using relaxed 
reachability analysis. Our evaluation has analyzed the robustness and sensitivity 
under adversarial attacks on a set of SSNs with typical architectures. From our 
experiments, we show that while max-pooling and ReLU layers are useful in 
training highly accurate SSNs, they are also the main sources of robustness 
and verification performance degradation. SSNs using average-pooling for down- 
sampling and transposed convolution for up-sampling seem to be an optimal 
choice for achieving high accuracy, robustness, and verification performance. 
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Additionally, our relaxed reachability approach can help to reduce significantly 
the total verification time for networks where the reachability time of ReLU 
layers dominates the network’s reachability time, and are applicable to other 
networks, such as CNNs used for classification. In the future, we will investigate 
new relaxation heuristics for the max-pooling layer and extend this work to 
cope with the encoder-decoder SSN architecture where max-unpooling layers 
are used for up-sampling operations, instead of dilated/transposed convolution 
as we considered in this paper. 
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Abstract. Neural Networks (NNs) have increasingly apparent safety 
implications commensurate with their proliferation in real-world appli- 
cations: both unanticipated as well as adversarial misclassifications can 
result in fatal outcomes. As a consequence, techniques of formal verifi- 
cation have been recognized as crucial to the design and deployment of 
safe NNs. In this paper, we introduce a new approach to formally verify 
the most commonly considered safety specifications for ReLU NNs — i.e. 
polytopic specifications on the input and output of the network. Like 
some other approaches, ours uses a relaxed convex program to mitigate 
the combinatorial complexity of the problem. However, unique in our 
approach is the way we use a convex solver not only as a linear feasibil- 
ity checker, but also as a means of penalizing the amount of relaxation 
allowed in solutions. In particular, we encode each ReLU by means of the 
usual linear constraints, and combine this with a convex objective func- 
tion that penalizes the discrepancy between the output of each neuron 
and its relaxation. This convex function is further structured to force the 
largest relaxations to appear closest to the input layer; this provides the 
further benefit that the most “problematic” neurons are conditioned as 
early as possible, when conditioning layer by layer. This paradigm can 
be leveraged to create a verification algorithm that is not only faster in 
general than competing approaches, but is also able to verify consider- 
ably more safety properties; we evaluated PEREGRiNN on a standard 
MNIST robustness verification suite to substantiate these claims. 


Keywords: Machine learning/AI - Decision procedures and solvers 


1 Introduction 


Neural Networks have become an increasingly central component of modern 
machine learning systems, including those that are used in safety-critical cyber- 
physical systems such as autonomous vehicles. The rate of this adoption has 
exceeded the ability to reliably verify the safe and correct functioning of these 
components, especially when they are integrated with other components such as 
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controllers. Thus, there is an increasing need to verify that NNs reliably produce 
safe outputs, especially subject to malicious adversarial inputs [16,20, 27,28]. 

In this paper, we propose PEREGRINN, an algorithm for efficiently and for- 
mally verifying the input/output behavior of ReLU NNs. In this context, PERE- 
GRiNN falls into the broad category of sound and complete search and optimiza- 
tion NN verifiers [22]. The search aspect of PEREGRiNN involves iterating over 
different combinations of neuron activation patterns to verify that each is compat- 
ible with the specified safety constraints (on the input and output of the network). 
Like other algorithms in this category, PEREGRiNN combines this search with 
optimization techniques to make inferences about the feasibility of full-network 
activation patterns on the basis of activation patterns of only a subset of neurons. 
The optimization in question reformulates the original NN feasibility problem into 
a relaxed convex feasibility problem to allow sound inferences: i.e. if the convex 
relaxation is infeasible, then the original NN problem may soundly be concluded 
to be infeasible. In this relaxed feasibility problem, the output of each individual 
neuron is assigned a relaxation variable that is decoupled from the actual output of 
that neuron. PEREGRINN also uses a type of reachability analysis (symbolic inter- 
val analysis) both to enhance the optimization-based inference described above 
and as asource of additional sound inference itself. For this reason, PEREGRiNN’s 
search procedure searches neurons in a layer-by-layer fashion, preferring to fix the 
phases of neurons closest to the input layer first. 

In contrast to other search and optimization algorithms, however, PERE- 
GRiNN augments each convex feasibility query with a (convex) penalty function in 
order to obtain better guidance on which activation patterns to search next. In par- 
ticular, we note that the amount of relaxation needed on a neuron can be regarded 
as a quasi-measure of how close the convex solver came to operating the associated 
neuron in a valid regime — i.e. at a valid evaluation of that neuron on a particu- 
lar input. In this sense, the amount of relaxation in aggregate can be regarded as 
a quasi-measure of how close the solver came to finding a valid evaluation of the 
network as a whole. Inversely, the largest distance between a relaxation variable 
and its neuron’s closest ReLU constraint intuitively corresponds in some sense to 
how “problematic” that neuron is with regard to obtaining such a valid evaluation. 
These distances we refer to as the “slacks” for each neuron. Thus, PEREGRiNN 
may be regarded as greedily minimizing a slack-based penalty. 

Finally, we evaluated the performance of PEREGRiNN by using it to verify 
the adversarial robustness of networks trained on the MNIST [21] dataset. Our 
experiments show that PEREGRiNN is on average 1.27 x faster than Neurify [31], 
1.24x faster than Venus [6], 1.15x faster than nnenum [4], and 1.65x faster than 
Marabou [19]. It also proves 27%, 19%, 10%, and 51% more properties than the 
other solvers, respectively. PEREGRiNN’s unique convex penalty augmentations 
are also considered in ablation experiments to validate their benefits. 


Related Work. Since PEREGRINN is a sound and complete verification algo- 
rithm, we restrict our comparison to other sound and complete algorithms. 
NN verifiers can be grouped into roughly three categories: (i) SMT-based 
methods, which encode the problem into a Satisfiability Modulo Theory prob- 
lem [11,18,19]; (ii) MILP-based solvers, which directly encode the verification 
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Fig. 1. Block diagram of the PEREGRiNN algorithm 


problem as a Mixed Integer Linear Program [3, 5-8, 14,23, 29]; (iii) Reachability 
based methods, which perform layer-by-layer reachability analysis to compute 
the reachable set [4,13,15,17,30,32,34,35]; and (iv) convex relaxations meth- 
ods [10,31,33]. In general, (i), (ii) and (iii) suffer from poor scalability. On the 
other hand, convex relaxation methods depend heavily on pruning the search 
space of indeterminate neuron activations; thus, they generally depend on obtain- 
ing good approximate bounds for each of the neurons in order to reduce the 
search space (the exact bounds are computationally intensive to compute [9]). 
These methods are most similar to PEREGRiNN: for example, [7,25,32] recur- 
sively refine the problem using input splitting, and [31] does so via neuron split- 
ting. Other search and optimization methods include: Planet [11], which com- 
bines a relaxed convex optimization problem with a SAT solver to search over 
neurons’ phases; and Marabou [19], which uses a modified simplex algorithm. 


2 Problem Formulation 


In this paper, we will consider Rectified Linear Unit (ReLU) NNs. An n-layer 
ReLU network, is a composition of n ReLU layer functions: i.e. MN = fno 
fn—1°+++0 fı where the it* ReLU layer function is defined as f; : y € R¥-1 => 
max{W;y + b;,0} € R*. We refer to fı as the input layer. Finally, to refer to 
individual neurons, we use the notation (z); to indicate the j'® element of z. 


Verification Problem. Let NN be an n-layer NN as defined above. Further- 
more, let Py, C R* be a convex polytope in the input space of NN, and 
let P}, C R*» be a convex polytope in the output space of NN. Finally, let 


he : R’0x Re — R, 2=1,...,m be convex functions defining joint input/output 
constraints on NN. Then the verification problem is to decide whether 
fa ER” | ze Pp ANN (2) € P,, A CA, he(z,NN(zx)) < 0)} =. (1) 


3 PEREGRINN Overview 


The general structure of PEREGRINN is depicted in Fig. 1. Like other search 
and optimization based NN verifiers it has two main components: a search com- 
ponent and an inference component, and PEREGRINN iterates back and forth 
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between these these two components until termination. In particular, the search 
and inference components interact in the following way. The search component 
successively iterates over all possible on/off activations for each neuron; this is 
done by fixing these activations one neuron at a time, starting from the input 
layer and working towards the output layer. The process of fixing a neuron’s acti- 
vation is referred to as conditioning its phase: each neuron can be in either its 
active phase (operating linearly) or inactive phase (outputting zero). Thus, the 
search component provides the inference component a subset of neurons, each of 
which has been conditioned; the inference component then attempts to soundly 
reason about whether the remaining, unconditioned neurons can be operated in 
such a way as to violate the safety constraint. If the inference component soundly 
concludes safety for all possible activations of the remaining unconditioned neu- 
rons, then the search component backtracks, oppositely reconditioning one of 
the neurons that was already conditioned. Otherwise, if a sound safe conclusion 
is not made, then the search component uses information from the inference 
component to decide on a new neuron to condition, and the process repeats. 
The algorithm terminates if either a counterexample to safety is found, or else 
all possible neuron activations are considered without finding such a counterex- 
ample. 

The convex program inference block is at the heart of the inference compo- 
nent and PEREGRINN itself. In this block, PEREGRiNN, like other search and 
optimization solvers, uses a relaxed linear feasibility program where the output 
of each individual neuron is assigned a relaxation variable that is decoupled from 
the actual output of that neuron. In the notation of Sect. 2, such a linear feasi- 
bility program can be written as follows, where the vector variables y;,i Æ 0 are 
the relaxation variables. 


yi > 0, yi > Wiyi-1 + bi Vi=l,...,n 
m 
Yo © Py, Yn E Pio A Uo Yn) <0 


(2) 
Importantly, if (2) is infeasible, then the original NN problem in (1) may 
be soundly concluded to be infeasible as well — and hence, safe. However, as 
described above, the primary function of the convex feasibility program is to 
use a set of conditioned neurons supplied by the search component in order to 
soundly reason about the remaining neurons. To do this, the conditioned neurons 
supplied by the search component are incorporated into the feasibility program 
(2) as equality constraints in the following way: 


Neuron (y:i); ON: (yi); = (Wiyi-1 + bi)j A (yi); 2 0 (3) 
Neuron (yi) 5 OFF: (yi) 5 =O0A (Wiyi-1 + bi) 5 <0. (4) 


Inferences created by the symbolic interval inference block using Symbolic Inter- 
val Analysis [32] are also incorporated using equality constraints like (3) and (4). 

Of the remaining blocks, the “Backtracking & Reconditioning” block is essen- 
tially described above. The “Condition New Neuron” and “Sampling Inference” 
blocks have features unique to PEREGRiNN that are described in Sect. 4; the 


PEREGRINN: Penalized-Relaxation Greedy Neural Network Verifier 291 


former implements a novel neuron prioritization, and the latter is a unique app- 
roach to quickly obtaining initial safety counterexamples. 


4 PEREGRINN Enhancements 


4.1 Sum-of-Slacks Penalty 


The core enhancement in PEREGRINN is the inclusion of a specific objective 
function in the convex program used by the inference component. As per the 
discussion above, this objective function is interpreted as a penalty on how far 
away a particular solution is from a valid input/output response of the network 
(and activation pattern on all hidden neurons). Specifically, this penalty function 
penalizes the sum of all of the “slack” variables for the entire network, where each 
neuron’s slack variable is defined as s; £ y; —(W;-y;-1+);). That is the distance 
between a relaxation variable y; and the linear response of its associated neuron. 
During each feasibility /inference call, this has the obvious effect of incentivizing 
the convex solver to choose an actual input/output response of the network. 

In addition, this penalty is effectively the Lı-norm of the vector of all the slack 
variables, since the slack variables are non-negative. The Lı-norm of a vector, 
used as a penalty function, is well known to effectively encourage sparsity on the 
resulting optimal solution. Thus, the sum-of-slacks effectively incentivizes the 
convex solver to leave as few neurons as possible indeterminate in the solution. 
That is a sum-of-slacks penalty effectively encourages the convex solver to fix 
the phases of as many neurons as possible. 


4.2  Max-Slack Conditioning Priority 


As noted above, the search component of PEREGRiNN operates layer-wise from 
input layer to output layer in order to leverage Symbolic Interval Analysis for 
additional inference. Hence, the search component always chooses the next neu- 
ron to be searched (i.e. conditioned) from among those as-yet-unconditioned 
neurons that are closest to the input layer. It further makes sense to only con- 
sider conditioning neurons that the convex solver was unable to operate at valid 
inputs/output. However, the convex solver typically returns several neurons to 
choose from with this property, and it is necessary to choose which of them to 
search next. Given the interpretation of a neuron’s “slack” variable as a measure 
of how “problematic” that neuron was for the solver to obtain a valid evaluation 
of the network, PEREGRiNN’s search component chooses the next neuron to 
condition based on slack-order ranking of those neurons that are not being oper- 
ated at valid input/output points. This “max-slack” heuristic choice is unique 
to PEREGRiNN; compare to the output gradient heuristic employed in [31]. 


4.3 Layer-wise-Weighted Penalty 


PEREGRINN takes the “max-slack” neuron search priority one step further, 
though. Using techniques similar to those in [26], it is possible to show that 


292 H. Khedr et al. 


there exists weights q1, ...,qn such that solving (2) with the penalty 


n ky 
min SO aisy (5) 


i=0 j=1 


will result in a solution that is guaranteed to concentrate the most total slack in 
the earliest (unconditioned) layer. Thus, by using the layer-wise weighted sum-of- 
slacks penalty in (5), PEREGRINN is uniquely able to force the (unconditioned) 
layer closest to the input layer to have the largest total slack among all the layers. 
As a consequence, PEREGRINN effectively concentrates the most “problematic” 
neurons in the layer where the next conditioning choice will be made. This 
scheme makes it much more likely that the neuron with the highest slack among 
all of the neurons will be among the next neurons considered for conditioning — in 
effect, often guiding the search component to condition on the most problematic 
neuron in the whole network (although this is not guaranteed). 

As noted above, SMC [26] can be used to obtain layer-wise weights that 
guarantee concentration of slack in the earliest (shallowest) layer. However, these 
weights are often very large, since they depend on bounding the slack variables 
(most readily by over-approximation); the effect of this is possible computational 
instability in the convex program. Thus, as an implementation matter, we instead 
select these weights using a heuristic scheme characterized by two real-valued 
hyperparameters, Ag and y. In particular, the weight of the iP layer, qi, is 
selected as q; = Ao - 7’. In our experiments, we found the values \y = 1077 and 
y = 10? to effectively achieve the maximum slack concentration in the earliest 
layers. 


4.4 Initial Counterexample Search by Sampling 


Finally, PEREGRiNN extends a simple idea first introduced in [32] to rapidly 
identify counterexamples by means of sampling. The basic idea is to sample 
within a known region of the input to the NN (or the input to some deeper layer), 
and evaluate the NN (sub-NN) exactly on those samples in order to rapidly iden- 
tify a counterexample; this approach help identify un-safe networks/properties 
early on. However, whereas [32] samples from within hyper-rectangle sets derived 
by symbolic interval analysis, PEREGRiNN uses the Volesti [12] Python library 
to uniformly sample points within the polytopic input constraint set, Py,, and 
thus applies to be more general input constraint sets in (1). 


5 Experiments 


We evaluated the performance and effectiveness of PEREGRiNN at verifying 
the adversarial robustness of NNs trained to recognize digits using the standard 
MNIST dataset. This verification problem fits into the general NN verification 
problem described in Sect. 2, and it is described subsequently in detail. In this 
context, we evaluated PEREGRiNN with two objectives described as follows. 
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Table 1. Architecture of the NN models used in the experiments 


Models # ReLUs | Architecture 

MNIST_FC1 | 512 < 784, 256, 256,10 > 

MNIST_FC2 | 1024 < 784, 256, 256, 256, 256, 10 > 
MNIST_FC3 | 1536 < 784, 256, 256, 256, 256, 256, 256, 10 > 


1. We conducted ablation experiments for all of PEREGRiNN’s novel features 
as described in Sect. 4. In particular, we compared the performance of a full 
implementation of PEREGRIiNN - i.e. exactly as described in Sect. 4 — with 
implementations that are otherwise the same except for changing one and 
only one of the following: the penalty function used in the convex program 
inference block; the neuron prioritization used by the search component. 

2. We compared PEREGRINN against other state-of-the-art NN verifiers, both 
in terms of the time required to verify individual networks and properties and 
in terms of the number of properties proved with a common, fixed timeout. 


Implementation. We implemented PEREGRINN in Python, and used an off- 
the-shelf Gurobi 9.1 [1] convex optimizer for solving linear programs; the Volesti 
[12] Python interface was used to sample from the input polytope for the sam- 
pling inference block. For the other NN verifiers, we used publicly available 
implementations that were published by their creators (citations are included 
below). Each instance of of any verifier was run within its own single-core Vir- 
tual Box VM with 30 GB of memory; no more than 4 VMs were run concurrently 
on a host machine with 48 hyperthreaded cores and 256 GB of memory. 


5.1 Adversarial Robustness Verification Task 


Subsequent experiments used the testbench we describe in this section; it is 
largely identical to the PAT-FCN test in the VNN-COMP 2020 competition [2]. 


Neural Networks. We used three ReLU NNs to recognize digits using the 
standard MNIST training database; these NNs are exactly as in the PAT-FCN 
portion of [2]. The sizes of these fully-connected networks are described in Table 1. 
Each entry in the “Architecture” column of Table 1 is the number of number of 
neurons in a layer, from input layer on the left to output layer on the right. 


Verification Properties. We created a number of NN verification tasks based 
on proving whether the above described networks were robust against max-norm 
perturbations of their inputs. In particular, each verification task involves prov- 
ing whether a particular input image, x’, always results in the same classification 
when it is subjected to a max-norm perturbation of at most some fixed size, € > 0. 
Thus, each such verification problem is parameterized by both the specified input 
image, x’, and the maximum amount of perturbation, e. 
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Formally, let x’ be a given image in category t € {1,..., M}, and let € > 0 
be a specified maximum amount of max-norm perturbation of x’. Then we say 
that a NN with M classification outputs, VN, is robust if for each classification 
category m € {1,..., M}\ {t} the set of inputs yielding classification of x’ as m 


is empty. Note that each instance of (6) is compatible with the problem in (1). 


Adversarial Robustness Verifier Testbench. Our verification testbench 
was then constructed by selecting 50 test images from the MNIST test dataset; 
this set of test images includes the 25 used in the PAT-FCN portion of [2]. Each 
test instance was then a combination of one of those images, one of the networks 
from Table 1 and one the following two max-norm perturbations, € = 0.02 or 
€ = 0.05; these perturbations are same ones used in PAT-FCN [2]. Thus, each 
verification test in our testbench can be identified by one of 300 tuples of the 
form: (net, image, perturb.) € 7B = {FC1, FC2, FC2} x {1,...,50} x {0.02, 0.05}. 


5.2 Ablation Experiments 


In this series of experiments we evaluated the contribution that each of the 
primary PEREGRiNN enhancements made to its overall performance. This was 
done by comparing the full PEREGRIiNN algorithm — as described in Sect. 4 — 
with altered versions that replace exactly one of those enhancements at a time. 
Note: removing core features of PEREGRiNN often resulted in much longer 
run times, so the experiments in this section use a testbench 74’ C FF that 
excludes all tests with one of the larger networks FC2 or FC3 and e = 0.05. 


Penalty Function Ablation. Our first ablation experiment evaluated the con- 
tribution of PEREGRiNN’s unique penalty function features; see Sect. 4.1 and 
Sect. 4.3. In particular, we ran different variants of PEREGRiNN with the fol- 
lowing penalty functions used inside the convex program inference block: 


1. “Weighted sum of slacks”: PEREGRiNN’s own weighted sum of slacks 
penalty; 

2. “Sum of slacks”: A sum-of-slacks penalty with equal weighting on all layers; 

3. “Feasibility”: A feasibility-only convex program such as the one used in other 
tools, e.g. [31] (i.e. simply using a constant penalty function of 1); 

4. “Inverted weighted sum of slacks”: PEREGRiNN’s own weighted sum of slacks 
penalty, except with the layer-wise weights applied in reverse order to force 
slack towards deeper layers rather than shallower ones (see also Sect. 4.3). 


Figure 2a shows a cactus plot of the number of proved cases vs. the timeout 
permitted to the algorithm: i.e. to prove at least a specified number of the test 
cases, each algorithm must have its timeout set at to the value of its curve in 
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Fig. 2a. Figure 2b shows a histogram of the number of times each of the algorithm 
variants needed to call the convex solver in order to terminate; this quantifies 
each algorithm’s cost in a well-known unit of computation, also the single most 
computationally costly part of PEREGRiNN. Figure 2b plots the number of 
convex solver calls required for evenly spaced bins of convex solver calls. 
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Fig. 2. Performance of PEREGRINN variants with different objective functions 


Conclusions: Figure 2a demonstrates that PEREGRiNN’s weighted sum of slacks 
has a clear benefit over both a uniformly weighted sum-of-slacks penalty and a 
plain feasibility convex program. For timeouts of longer than ~ 1.2 seconds, 
PEREGRINN overtakes the other two in terms of number of properties proved; 
even the uniform sum-of-slacks penalty considerably outperforms the feasibility 
convex program at similar timeouts. Note that reversing the layer-wise weights of 
PEREGRINN’s penalty function incurs a performance hit, especially for timeouts 
>1.2 s. This suggests that driving slacks toward shallower layers, where the next 
neuron is conditioned, is the correct heuristic to apply. Figure 2b also shows that 
going from feasibility to sum-of-slacks to weighted sum-of-slacks significantly 
reduces the number of test cases that require between 425 and 525 calls to the 
convex solver. This order of comparison shows a concomitant net influx of tests 
into the lowest bin of < 25 convex calls; PEREGRiNN has the most test cases 
in this category, with +130 test cases proved in < 25 convex solver calls. 


Neuron Conditioning Priority Ablation. In the second ablation experi- 
ment, we evaluated the contribution of PEREGRiNN’s maximum-slack neuron 
conditioning priority (see Sect.4.2). To that end, we ran variants of PERE- 
GRiNN with three different neuron conditioning priorities for the search compo- 
nent: 


1. “Maximum slack”: PEREGRiNN’s max-slack neuron conditioning priority; 
2. “Minimum slack”: This variant conditions the neuron with the smallest slack; 
3. “Random choice”: This variant conditions on a random indeterminate neuron. 


The performance of these algorithm variants is shown in Fig. 3a and Fig. 3b. 
As in the previous ablation experiment, Fig. 3a shows a cactus plot of the number 
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of proved cases vs. the timeout, and Fig. 3b shows a histogram of the number of 
calls to the convex solver required under each of the conditioning priorities. 


Conclusions: Figure 3a shows that PEREGRiNN’s max-slack neuron priority 
allows it to prove slightly more properties than either a random neuron choice 
priority or the minimum-slack priority. The maximum slack priority also required 
the fewest total convex calls across all instances: it used 178 fewer than minimum 
slack and 686 fewer than a random choice. Thus, we conclude PEREGRiNN’s 
max-slack heuristic slightly improves performance on this testbench. 


5.3 Comparison with Other NN Verifiers 


In this experiment, we evaluated PEREGRiNN with respect to a number of 
state-of-the-art NN verifiers on our adversarial robustness testbench, JZZ. In 
particular, we ran the following tools on YY: Venus [6]; Marabou [19]; Neu- 
rify [31]; and nnenum [4]. Venus was run with st_ratio=0.4, depth_power=4, 
offline_deps = True, online_deps = True, and ideal_cuts = True; Marabou 
and Neurify were used with default parameters but THREADS = 1; and nnenum 
had ADVERSARIAL_SEARCH turned off. Each algorithm had its own one-core VM. 
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Fig. 3. Performance of PEREGRINN variants with different conditioning priorities 


Figure4 contains a cactus plot showing the results for each of these algo- 
rithms, including PEREGRINN. For a given number of test cases to be proved, 
Fig. 4 depicts the corresponding timeout required for each of the algorithm to 
prove that many cases. Of all the algorithms, PEREGRiNN was able to prove 
the most properties within the timeout limit of 600s: PEREGRiNN was able 
to prove 190 properties; it was followed by nnenum, which proved 172; Venus, 
which proved 159; Neurify, which proved 149; and Marabou, which proved 125. 
Marabou consistently performed the worst, proving fewer cases than any other 
algorithm at every timeout. By contrast, Neurify was able to prove significantly 
more test cases than any other algorithm for extremely short timeouts, but it 
failed to prove more than 150 out of 300 test cases across the whole experiment. 
nnenum performed worse than Neurify on the way to proving 150 test cases, but 
it fared significantly better than either PEREGRiNN or Venus, which had more 
or less similar performance below this threshold. However, after ~150 test cases, 


PEREGRINN: Penalized-Relaxation Greedy Neural Network Verifier 297 


PEREGRINN significantly outperformed all other algorithms: as the timeout 
was increased, PEREGRINN proved additional properties at a rate significantly 
outpacing its closest competitor in this regime, nnenum. We further note that 
all algorithms proved a mixture of SAT and UNSAT properties. 

This data, taken as a whole, suggests that PEREGRiNN suffers from a worse 
“best-case” performance than several other algorithms, especially nnenum and 
Neurify. However, PEREGRiNN’s performance seems to be much more consis- 
tent across different test cases. This allows it to prove more properties in aggre- 
gate at the expense of being slower on a smaller subset of them. This further 
suggests that PEREGRINN is significantly less sensitive to peculiarities of partic- 
ular test cases on the YY testbench. This will likely be a considerable advantage, 
on average, when faced with verifying unknown networks and properties of this 
type. 


6 Discussion: Analogy to SAT Solvers 


It is possible to draw a loose analogy between SAT solvers and search-and- 
optimization NN verifiers such as PEREGRiNN. Indeed, since each neuron has 
two phases, the operational phase of each neuron can be captured by a binary 
variable; then any valuation of all these variables can be interpreted as SAT or 
UNSAT based on the Input/Output properties to be verified on the network 
(subject to that conditioning). Thus, the neuron conditioning step in PERE- 
GRINN is analogous to variable splitting in a SAT solver, and the backtrack and 
re-condition block (see Fig.1) functions analogously to backtracking. In this 
analogy, infeasibility of the convex program and symbolic interval analysis func- 
tion roughly like unit resolution in a SAT solver: they soundly reason about the 
overall property before all neurons have been conditioned (i.e. variables split). 


© 10? 
a 
zg 10! — PEREGRINN 
fo} = Venus 
£ 10° —— nnenum 
i= —— Marabou 
= —— Neurify 
107% T T T T T 
0 25 50 75 100 125 150 175 


Proved cases 


Fig. 4. Cactus plot of various solvers on 300-case testbench, 7A 


However, the main contribution of PEREGRiNN is a heuristic for deciding 
which neuron to condition next: it is thus analogous to a heuristic for choosing 
the next variable to split in a SAT solver. Specifically, PEREGRiNN’s heuristic 
provides a numerical ranking of the as-yet-unconditioned neurons, and therefore 
has a functional similarity to variable-ranking heuristics in SAT solvers (e.g. 
VSIDS [24]). On the other hand, PEREGRiNN’s neuron ranking comes directly 
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from the output of the convex solver, which we argued reveals some information 
about the underlying verification problem — this has no direct SAT-solver analog. 


7 Conclusion 


In this paper, we introduced PEREGRINN, a new tool for formally verifying 
input/output properties for ReLU NNs. PEREGRiNN compares favorably with 
other state-of-the-art NN verifiers, thanks to a number of unique algorithmic fea- 
tures. The benefits of these features were established with ablation experiments. 
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Abstract. Architecture specifications such as Armv8-A and RISC-V 
are the ultimate foundation for software verification and the correct- 
ness criteria for hardware verification. They should define the allowed 
sequential and relaxed-memory concurrency behaviour of programs, but 
hitherto there has been no integration of full-scale instruction-set archi- 
tecture (ISA) semantics with axiomatic concurrency models, either in 
mathematics or in tools. These ISA semantics can be surprisingly large 
and intricate, e.g. 100k+ lines for Armv8-A. 

In this paper we present a tool, Isla, for computing the allowed 
behaviours of concurrent litmus tests with respect to full-scale ISA def- 
initions, in Sail, and arbitrary axiomatic relaxed-memory concurrency 
models, in the Cat language. It is based on a generic symbolic engine 
for Sail ISA specifications, which should be valuable also for other veri- 
fication tasks. We equip the tool with a web interface to make it widely 
accessible, and illustrate and evaluate it for Armv8-A and RISC-V. 

By using full-scale and authoritative ISA semantics, this lets one eval- 
uate litmus tests using arbitrary user instructions with high confidence. 
Moreover, because these ISA specifications give detailed and validated 
definitions of the sequential aspects of systems functionality, as used by 
hypervisors and operating systems, e.g. instruction fetch, exceptions, and 
address translation, our tool provides a basis for developing concurrency 
semantics for these. We demonstrate this for the Armv8-A instruction- 
fetch model and self-modifying code examples of Simner et al. 


1 Introduction 


A processor architecture should define, for any initial machine state, the set 
of all architecturally allowed observable executions—thus specifying the basic 
assumptions for programming and for software verification, and the correct- 
ness criterion for hardware verification. Architecture specifications have two 
main parts: the sequential and relaxed-memory concurrent aspects of instruc- 
tion behaviour, each of which have been studied in previous work. For Armv8- 
A and RISC-V, Armstrong et al. have established full-scale sequential mod- 
els in Sail [10,15], a domain-specific language for instruction-set architecture 
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(ISA) specification, that are complete enough to boot real-world operating sys- 
tems such as Linux. For Armv8-A this model is automatically derived from 
the authoritative Arm-internal specification [24], while for RISC-V it has been 
hand-written and adopted by RISC-V International. On the concurrency side, 
relaxed-memory semantics can be specified in two main styles: either as abstract- 
microarchitectural operational models, characterising observable behaviour with 
explicit out-of-order execution and buffering, or as axiomatic models, expressed 
as a predicate over complete candidate executions represented as graphs of mem- 
ory events. For Armv8-A and RISC-V “user” concurrency, both exist [1,7,8, 22], 
along with a “Promising ARM” variant [23]. For Armv8-A they have been proved 
equivalent [21,22]; the authoritative vendor definition is the axiomatic one. 

However, while an architecture should define the set of allowed executions for 
arbitrary programs, hitherto there has been no integration of full-scale ISA defi- 
nitions with axiomatic concurrency models, either in mathematics or in tools (for 
operational models, this has only been done for RISC-V; other operational mod- 
els have used small ISA fragments). Research and industry practice for relaxed 
memory semantics rely on making the semantics executable as a test oracle: 
not just a paper definition (in prose or mathematics), but tool-supported def- 
initions that for small litmus-test examples can compute the set of all allowed 
executions, that can then be compared against experimental data. Many tools 
have been developed for operational and axiomatic architectural concurrency 
models [4,6,8,12, 14, 17-20, 25,26,28-32], with axiomatic tools notably includ- 
ing the Herd tool of Alglave and Maranget [4,6,8], that can evaluate litmus 
tests w.r.t. axiomatic memory models specified in a relational-algebra style in 
the Cat language [2]. However, all of these previous tools for axiomatic models 
have (at best) used hard-coded ISA semantics that cover only small fragments 
of the complete architecture. For example, Zhang et al. [32] use a SMT solver 
based approach for SoC verification, with a user-specified memory model (TSO 
or SC), however the instruction level abstractions (ILAs) they use are much more 
abstract than the ISA semantics we consider. 

In this paper we describe a tool, Isla, that integrates full-scale ISA spec- 
ifications, in Sail, with arbitrary axiomatic models, in the Cat language. We 
first build a generic symbolic execution library for Sail specifications—which 
should also be valuable for other verification tasks. We use this to construct a 
tool for symbolically running binary litmus tests for any Sail ISA under any 
(non-recursive) Cat axiomatic memory model, using an SMT solver. We equip 
it with a web interface to make it widely accessible, and illustrate and evaluate 
all this for Armv8-A and RISC-V. Isla is available at https://isla-axiomatic.cl. 
cam.ac.uk and https: //github.com/rems-project /isla. An extended version of the 
paper [11], available at https://www.cl.cam.ac.uk/~ pes20/isla/, includes appen- 
dices showing the main parts of the full Sail/ ASL semantics of a sample Armv8-A 
instruction (add x4, x3, #1); the Armv8-A axiomatic concurrency model (com- 
bining the official Arm specification for user concurrency [9,13] with the addi- 
tions for instruction fetch semantics by Simner et al. [27]); and examples of the 
latter. 
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Our approach has several key advantages, which all follow from the fact that 
mainstream industry ISAs are surprisingly large and intricate. The Armv8-A ISA 
specification is around 100k lines. It defines the sequential behaviour of the full 
instruction set in all its detail, including e.g. instruction decoding, behaviour at 
each exception level, register banking, floating-point, vector instructions, system 
registers, exceptions, address translation, virtualisation, security extensions, and 
a host of optional architectural features. Simple litmus tests developed to investi- 
gate user concurrency have historically used only very few instructions and very 
little of this, and hand-written ISA models have sufficed, but even a ‘simple’ 
ADD instruction can, in reality, involve surprisingly much of the specification. If 
one wants to examine arbitrary compiler-generated code one needs many more 
instructions; and to develop systems concurrency semantics, e.g. covering the 
concurrency behaviour of instruction fetch, exceptions, or address translation, 
one might need any of the specification—and it would be exceedingly laborious 
and error-prone to reproduce it by hand in a hard-coded semantics. By handling 
the full authoritative Armv8-A ISA, we automatically support litmus tests that 
use arbitrary instructions, and we enable research on systems concurrency, with 
high confidence that the ISA follows the vendor specification. We demonstrate 
this by applying our tool to the model and examples for self-modifying code by 
Simner et al. [27], and our integration has also identified several places where the 
ISA specification needs modifications to correctly give the intended behaviour in 
a concurrent setting, e.g. to remove or enforce additional ordering. Because this 
is based on authoritative Arm and RISC-V ISA specifications, the work should 
enable relaxed-memory behaviour to be included in the standard test-edit-debug 
cycle used in the development of such large and critical specifications. 


2 Implementation 


Axiomatic relaxed-memory concurrency models, being expressed as logical con- 
straints over candidate execution graphs, lend themselves to solver-based tool 
implementations. For the instruction-semantics part of such a tool, the most 
direct approach would be to translate the ISA semantics (for the instructions that 
occur in a litmus test) directly into SMT and combine that with the axiomatic- 
model constraints, roughly along the lines of Alglave et al. [3]. That approach 
was followed by Simner et al. [27], who compiled Sail directly into SMT to test 
an axiomatic model for instruction-fetch tests, but using a small handwritten 
Arm fragment, rather than the full Sail model derived from the Arm-internal 
model. The problem with this direct approach is one of scale: as one covers more 
of the Arm semantics, the resulting SMT problem simply becomes too large to 
be practicable. For example, for a load instruction, the virtual address must be 
translated into a physical address, which is a complex process with a great deal 
of configurability—there may be zero, one, or two stages of address translation, 
the page size may vary, the number of levels used in the page table may differ, 
etc. This approach also required the top level fetch-execute-decode loop to be 
handled specially, as one cannot translate such an unbounded loop directly into 
SMT, which imposes significant constraints on the shape of allowable tests. 
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In contrast, here we build and use a generic symbolic evaluation for Sail def- 
initions using the Z3 SMT solver, which lets us compute the possible symbolic 
thread-local traces of each instruction, and hence of each thread (treating mem- 
ory read values as unknowns, left to the concurrency model constraints). It also 
lets us use the same fetch-decode-execute loop that is used for emulation and 
co-simulation (which embodies various architecture-specific subtleties). 


2.1 Symbolic Execution for Sail 


Sail is attractive for symbolic execution for several reasons. First, it is an inten- 
tionally simple language, lacking many of the features found in general-purpose 
languages. Second, it has to support very few programs, just the specifications 
of major ISAs, so (unlike tools for conventional programming languages) we can 
tune the execution to them. Third, almost all of the loops in these programs 
are bounded. Our starting point is the translation of Sail to C, for emulation, 
by Armstrong et al. [10]. This goes via a simple goto-language intermediate 
representation which is already well-suited for this task. 


Static Function Linearisation. Our symbolic execution always creates a new 
task when we hit a branch, and we do not ever merge these tasks at join points. 
This is a good strategy for instruction semantics, as it simplifies the symbolic 
execution engine significantly, but it does mean some code can cause unnecessary 
branching. To avoid this we have a static rewrite that can take a function with 
if statements and rewrite it into a ‘linear’ form, e.g. as below: 


var x = 2; 
; k let x0 = 2; 
TE { => let b = undefined; 
} else { let x1 = x0 + 1; 
re let x2 = x0 + 2; 
N let x3 = ite(b, x1, x2); 
yi return x3 
return x 


This works by translating the body of the function into SSA form, then 
replacing the ¢-functions with if-then-else (ite) functions that translate into the 
SMT ite. This results in a more complex SMT expression, but less branching in 
the symbolic execution, so it is a trade-off, but often worthwhile. 


Per-Thread Candidate Executions. For each litmus-test thread this sym- 
bolic execution will produce a number of candidate executions, each of which 
is a sequence of memory events (memory reads and writes, fences, register 
accesses, and so on) with the symbolic values of these events potentially being 
constrained by some SMT formula for the overall execution. For example, con- 
sider the Armv8-A instruction add x4, x3, #1. For this instruction, our symbolic 
evaluator generates an execution: 
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(declare-const input (_ BitVec 64)) 

(read-reg |R3| nil input) 

(define-const output (bvadd input #x0000000000000001) ) 
(write-reg |R4| nil output) 


where the SMTLIB formula is defined by the declare-const and define-const 
statements, with read-reg and write-reg effects indicating which variables in the 
SMT formula correspond to the values read and written to registers (which are 
otherwise just global variables) by the instruction. We simplify here for brevity, 
omitting the negative, zero, carry and overflow flags that the model computes. 
For more complex instructions, there are additional effects for memory accesses, 
cache maintenance events, barriers, and so on. 


2.2 Checking a Litmus Test 


Figure | shows the overall process of checking a litmus test. Tests can be supplied 
either in the .litmus format of previous axiomatic and operational tools [4,5, 
14], reusing the parser from [4], or as a TOML file (a standard configuration 
file format, with libraries available for most languages). We first assemble the 
test with a conventional assembler into an ELF binary and load it into the 
representation of memory that will be used, before initialising the model with 
the program counter set to the entry point for each thread, then we symbolically 
execute the instructions in each thread separately, using the Sail semantics for 
each instruction, plus the same fetch-execute-decode loop in Sail we would use 
for emulation, to produce sets of per-thread traces as above. Treating litmus tests 
essentially as binaries, rather than the more-or-less ad hoc fragments of assembly 
abstract syntax used by earlier tools, accommodates the fact that the Armv8-A 
model does not define an abstract syntax, and reduces the gap between what 
the tool evaluates and what is run in experimental testing. Note that the Arm 
assembly in Fig. 1, as well as subsequent assembly snippets in this paper, use 
the standard Arm convention that x0 and w0 refer to the same register, where w0 
refers to the lower 32-bits of the register, and x0 refers to the full 64-bit width. 

We then generate an SMT problem for every combination of the candidate 
executions of each thread. This problem consists of the per-thread SMT formulae 
concatenated together (renaming variables as necessary to avoid name-clashes), 
combined with the axiomatic memory model (described in more detail below). 

Finally, we need to generate some ‘glue’ SMT that connects the per-thread 
semantics with the memory model. For every effect in the per-thread SMT 
semantics we generate an enumeration of events, e.g. for an execution with two 
reads and two writes: 


(declare-datatypes ((Event 0)) (((R1) (R2) (W1) (W2) (IW)))) 


The event IW is a special write event that represents the initial state. We generate 
relations such as value-of that relate events to their values as determined by the 
effects in the per-thread semantics, so if the second read event R2 read the value 
#xABCD, (value-of R2 #xABCD) would be true. We generate syntactic dependency 
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Litmus test MP. litmus 


SMT problem 


Thread #0 Generate candidates 


Initial state: x3 = y, x1 = £ 


mov w0, #1 , | 52800020 

istr wO, [X1] ; assemble ' b9900020 | : 
mov w2, #1 | ! 52800022! 
istr w2, [x3]; ' b9000062 | 


| Final state assertion 
| 1:x0 = 1 & 1:x2 = 0, 


Thread #1 


Initial state: x3 = x, x1 = y 


'tdr w0, [x1] | assemble , 69400020! 
‘dr w2, [x3] | ‘9400062 | J 


(check-sat) 


Parse model and generate graph (if satisfiable) 


Initial State 


co 
ae #0 © OOO Thread #1 
a str w0, [x1] ldr w0, [x1 
i W #x600000 (4): 1 R #x600010 (4): #x1 32 
fy rf | 
y 
str w2, [x3] ldr w2, [x3] 
W #x600010 (4): 1 R #x600000 (4): #x0 32 


Fig. 1. Overview of process for checking the allowed executions of a litmus test 


relations for address, data, and control dependencies, discussed in more detail 
in Sect. 2.3. Finally, there is a constraint on the final state of each test which 
specifies values expected in registers and memory after all threads have executed. 

The Cat language represents axiomatic memory models as definitions of rela- 
tions over the above events, and constraints over those relations, e.g. that spe- 
cific relations are irreflexive, acyclic, or empty (or the negation of any of these). 
Relations are defined in a point-free relation-algebraic style, in terms of standard 
relational operators such as composition, intersection and union. The memory 
models we consider are all multi-copy-atomic, and all recursion in their defini- 
tions can trivially be replaced with (reflexive)-transitive closure. Herd’s let rec 
construct computes the least solution to a set of equations [2], which is tricky 
to represent in SMT, so we do not support it. We believe even relations such 
as Power’s (mutually recursive) preserved program order are nevertheless repre- 
sentable as SMT, so this limitation is mostly in our translation from Cat—we 
would likely want to use a different syntax to represent these relations for Isla. 

A satisfiable solution to the overall SMT problem described above thus rep- 
resents an execution permitted by the architecture. Parsing the model generated 
by the SMT solver allows us to generate a graph of the execution by instantiating 
each relation in the model with the various events. If all generated SMT problems 
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are unsatisfiable for every combination of per-thread candidate executions then 
there are no permitted executions. If desired we can repeatedly ask the SMT 
solver for additional distinct models until we have all permitted executions. 


2.3 Syntactic Dependency Analysis 


Axiomatic memory models for relaxed hardware architectures rely heavily on 
notions of address, data, and control dependencies between instructions. For 
example, consider the following assembly: 


ldr w0, [x1] // load 32 bits from address in xl into x0 
cbnz w0, LCO1 // compare and branch if non-zero to LCO1 
LCO1: 
mov w2, #1 // load 1 into x2 
str w2, [x3] // store 32 bit-value in x2 to the address in x3 


Here there is a control dependency between the load (ldr) and the store (str), as 
the value read by the load is used to determine whether the branch instruction 
cbnz that precedes the store is taken or not. This control dependency exists irre- 
gardless of whether the branch is taken or not—its existence is purely determined 
by the syntactic structure of the above code. 

In general, existing ISA descriptions do not cover this aspect of the archi- 
tecture well, as they are principally developed only to describe the sequen- 
tial behaviour. Previous tools have either hand-coded dependency information, 
which is acceptable for cut-down ISA models but too laborious and error-prone 
at the scale of the ISA models we use, or used a heavyweight taint-tracking 
interpreter [15]. Our approach avoids both of these. It is similar to the latter, 
computing dependencies from the ISA specification, but building the footprint 
analysis atop our symbolic execution library requires only around 500 LoC. 

To express dependencies, we need to associate each event in our candidate 
executions with the syntactic instruction/opcode that generated them. To do 
this we use a Sail function —-instr_announce (opcode), called in each architecture’s 
fetch-decode-execute loop just after fetching an instruction; this adds a special 
effect to the candidate execution recording the instruction opcode. We also have 
another special effect that delimits each fetch-decode-execute cycle, so each effect 
such as read-mem and write-mem that would give rise to an event can be associated 
with an opcode, as well as an index in the program order relation for its thread. 

For each instruction we also need to know its footprint: data about the 
instruction including which input registers it reads, which output registers it 
writes, whether it is a branch instruction, and so on. It also contains taint 
information—we need to know which registers writes may contain data ‘tainted’ 
by a memory read performed by a load, or which input registers ‘taint’ data 
written to memory. The Sail ISA specifications do not explicitly describe this 
footprint, so we are forced to derive it from the specification. 

To do this we symbolically evaluate each opcode independently in a suitably 
unconstrained environment so as to capture all its possible behaviours. This 
can be computationally expensive due to the number of possible behaviours 
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some instructions have, so we build a footprint cache to avoid re-computing this 
where possible. It turns out to be hard to distinguish ordinary branches from 
instructions that can cause an exception to occur, so we add a special branch 
address announce effect, created by a Sail function __branch_address_announce 
that we call in branch instructions. This also enables the taint tracking for branch 
addresses we need for control dependencies as described above. The taint tracking 
is achieved simply by looking at what sub-expressions in the generated SMT 
problem contain variables that also appear in the various effects in each trace. 

Once we have this footprint information we can analyse it for the opcodes 
between each read and write effect and derive the necessary dependency relations 
over their events. Note that this dependency relation must be exact. If we under- 
approximate, we will allow executions that should be forbidden, and if we over- 
approximate we will forbid executions that should be allowed. 

In some cases the current Arm-provided ISA specification does not include 
enough information to identify the architecturally respected dependencies, and 
our dependency analysis would identify a dependency when there should not be 
one. To solve this we add some special Sail functions that give fine-grained control 
of the dependency calculation. For example, in indirect branches we ignore any 
dependency between the target register Xn and the link register X30 by including 
a function in the Sail definition that tells the footprint analysis to ignore any 
relation it finds between the two registers. 


if branch_type == BranchType_INDCALL then { 
ignore_dependency_edge(n, 30); 
X(30) = PC() + 4 
}; 
This works by adding a special annotation in the candidate execution trace 
which can be used by the footprint analysis—for all other purposes it is a no- 
op. This information should properly become part of the architecture specifica- 
tion, as mistakes in the dependency calculations could be a source of soundness 
bugs. The lack of support for this information in existing ISA specifications can 
partly be explained by the lack of tooling to properly explore the integration of 
ISA specifications with concurrency, something we hope a tool such as ours can 
address. 


2.4 Web Interface 


Figure 2 shows the web interface we have developed for our tool, based on the 
web interface for the C memory model tool Cerberus-BMC by Lau et al. [16]. 
This can either be run locally, or via a website, https: //isla-axiomatic.cl.cam.ac. 
uk. 


3 System Litmus Tests 


As mentioned previously, one advantage of our tool is that, because it supports 
the full sequential ISA, it enables easy experimentation with tests and models 
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MP/aarch64 ~ | Litmus file + Memory model ~ Sail architecture (aarch64) ~ 


MPtom! | Memory model 


Options ~ Share ~ 


aph x o 


name "MP" let dmb.ld = dmb.fullld | dmb. 
hash "211d5b298572012a0869d: i b 
4 symbolic = ["x", "y" 


5 3 let dsb.ld = (po & (_ * DSB.LD 
6 [thread.0] i let isb = (po & (_ * ISB)); po 


7 init = { X3 = "y", X1 = "x" } : 
8 code = """ show dmb.sy,dmb.st,dmb.ld,dsb. | | i Thread #1 
MOV WO, #1 i i i 
10 STR WO, [X1] 8 (* Dependencies *) H ! ldr w0, [x1] \ 
11 MOV W2,#1 9 show data, addr ! 4x132 | | 
12 STR W2, [X3] let ctrlisb = (ctrl & (_* ISB || See 
wee show ctrlisb i 
14 2 show isb \ ctrlisb as isb 1 
15 [thread.1] show ctrl \ ctrlisb as ctrl ' 
16 init = { X3 = "x", X1 = "y" } || 7 ! 
17 code = """ 75 (* { 
18 LDR WO, [X1] * As a restriction of the mod |! 
19 LDR W2, [X3] * inner-shareable domain. Conf | } j 1 
mes * options are all equivalent 1) ste w2, [x3] H Idr w2, [x3] i 
*) i etli 600000 (4): 2l: 
SN E o tet dsb.futl = psB.1sH | osB.o} |} #600010 0:1 J; ais Oe] | 
1 Allowed: 1 satisfiable ail let dsb. 1d = DSB.ISHLD || (DSB: 0g i 
solutions out of 1 let dsb.st = DSB.ISHST | DSB.O 
candidates 3 


ye 
* A further restriction is th 
* distinguish between DMB and 
* them as equivalent to each 
ay 
let dmb.full = DMB.ISH | DMB.O 
let dmb.ld = DMB.ISHLD | DMB.O 
let dmb.st = DMB.ISHST | DMB.O 


(* Flag any use of shareabilit 

1 flag ~empty (dmb. full | dmb.td 
DMB.SY | DMB.LD | 

as Assuming-common-inner 


(* Coherence-after *) 
let ca = fr | co 


Fig. 2. Web interface for the tool 


outside the scope of previous tools, e.g. involving new systems features. For 
example, Simner et al. developed semantics for Arm instruction fetch and I/D 
cache maintenance [27]. Consider the litmus test in Fig.3 [27, §3.3], a simple 
test involving self-modifying code. In order to run this test and the others in [27] 
our tool required only minimal changes: we had to add support for data-cache 
and instruction-cache maintenance events and relations for them in our Cat 
to SMT translation. Additionally we needed to generalise how we generated 
the rf (reads-from) relation to generate both the regular rf relation and the 
new irf (instruction-reads-from) relation. Because our tool already runs tests 
using a fetch-execute-decode loop, all the instruction fetch events were already 
available—we in fact filter them out when running user-mode tests. 

When generating candidate executions for a thread we normally do not 
assume anything about what other threads may be doing, but for self-modifying 
code this would clearly be problematic, as it would imply that any other thread 
could modify any of this thread’s instructions arbitrarily. We therefore mark the 
memory locations that contain instructions that can be modified and provide in 
advance all the possible values they might take. 


4 Results and Comparisons 


We evaluate our tool for correctness and performance with respect to Herd using 
previous corpora of tests. 
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Str w0, [x1] In the initial state register x1 contains the address of the 
1 ' 3 : 
ie ae a label f, and register w0 contains the opcode for the branch 
2 , r $ : 7 ‘ * 
dai en instruction b 11. Without the highlighted cache-maintenance 
E ee and barrier instructions on lines 2—6, the write of that opcode 
4 ic ivau, xl i 
ji dee teh to f performed by the store on line 1 may or may not be 
° ne observed before the instruction fetch for f, so at the end of 
6 $ . r * 
the test the register w2 can contain either 1 or 2, depending 
7 bl f 
on whether we branched to 11 or 10. 
8 mov w2, w10 j k A i i 
b Lout The highlighted instructions on lines 2-6 are a sequence of 
9 i $ è + 
7 o data-cache (dc) and instruction-cache (ic) maintenance in- 
10 ry . ‘ ica . . . 
structions with requisite data and instruction barriers that 
11 e k mov w10, #2 ae : 
rat must occur to guarantee that the write is observed by the in- 
12 * . 
struction fetch, as documented by the Armv8-A architecture 
13 10: mov w10, #1 f 4 
reference manual [7] and captured by the axiomatic model of 
14 ret ‘ 
Simner et al. |27] 
15 Lout: 


Fig. 3. Self-modifying code litmus test SM-+cachesync-isb 


We select 3798 litmus tests for both Armv8-A and RISC-V to compare 
between our tool and Herd—these tests include a representative set of features 
such as barriers and atomics, while exercising all of the basic litmus test shapes. 
All tests were run on a 2.6GHz Intel Xeon Gold 6240 CPU with 36 physical 
cores and 400GB of RAM. The tests are split into rough categories based on 
the contents of the tests. We ran 36 concurrent instances of both our tool and 
Herd across each set of tests, running Herd with the -speedcheck fast flag which 
causes it to stop enumerating executions when it resolves the final assertion in 
each test, which is the closest behaviour to how our tool behaves by default. 

To assess correctness, we use a set of golden references for these above tests, 
for all of which the previous operational RMEM [14] and axiomatic Herd models 
and tools agree, and which have been extensively validated against hardware 
implementations. We confirm that our tool produces the same expected results 
as those models for all the litmus tests, including when run in exhaustive mode. 

To assess performance, the table below gives the total real execution time for 
each batch of tests. 


Test set Number of tests | Isla Herd 
Armv8-A basic 2-thread | 1377 49 s lls 
Armv8-A basic 3-thread| 161 11.7s 1.2s 
Armv8-A exclusives 23 20.2 s 1.5s 
Armv8-A DMB/LD 70 7.45s 0.78 
Armv8-A PPO 2020 3m 29.38 | 16.2s 
RISC-V basic 2-thread 36 0.78 0.28 
RISC-V AMOs 111 2s 0.7s 
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In general Herd is faster for nearly all tests, but this is not surprising given 
the amount of detail in the full-scale instruction semantics that we are using, 
particularly for Armv8-A. Our goal is not to be faster, but to support those 
full-scale ISA semantics while remaining fast enough for practical purposes. We 
achieve this: most tests take only a second or so to run, which is perfectly usable 
interactively. For example, given the Armv8-A basic 3-thread tests, for a single 
sequential run of the tests, the shortest took 872 ms to run, while the longest 
took 1231ms. The above batch times are similarly perfectly usable for (e.g.) 
regression testing while editing a model. 

We also evaluate our tool with respect to that of Simner et al., for the 
instruction-fetch tests (which are currently not supported by Herd) in Sect. 6 
of their paper. Our tool returns the expected results for all these tests, includ- 
ing the two tests (FOW and SM.F-+ic) that were unsupported by their tool. 
In terms of performance, we note that their tool took 30 min to run just 90 of 
the 1377 basic 2-thread tests above, which is awkwardly slow for using a tool in 
practice, whereas when limiting our tool to 8 cores (to more closely match their 
experimental setup) our tool will execute all 1377 in under 3 min. We were addi- 
tionally able to provide further validation that the Simner et al. model behaves 
as the standard Armv8-A model for non-self-modifying tests by showing that it 
behaves identically for all 3798 of the non-self-modifying tests above. 
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Abstract. Some of the most significant high-level properties of curren- 
cies are the sums of certain account balances. Properties of such sums can 
ensure the integrity of currencies and transactions. For example, the sum 
of balances should not be changed by a transfer operation. Currencies 
manipulated by code present a verification challenge to mathematically 
prove their integrity by reasoning about computer programs that operate 
over them, e.g., in Solidity. The ability to reason about sums is essential: 
even the simplest ERC-20 token standard of the Ethereum community 
provides a way to access the total supply of balances. 

Unfortunately, reasoning about code written against this interface is 
non-trivial: the number of addresses is unbounded, and establishing global 
invariants like the preservation of the sum of the balances by operations like 
transfer requires higher-order reasoning. In particular, automated reason- 
ers do not provide ways to specify summations of arbitrary length. 

In this paper, we present a generalization of first-order logic which 
can express the unbounded sum of balances. We prove the decidablity 
of one of our extensions and the undecidability of a slightly richer one. 
We introduce first-order encodings to automate reasoning over software 
transitions with summations. We demonstrate the applicability of our 
results by using SMT solvers and first-order provers for validating the 
correctness of common transitions in smart contracts. 


1 Introduction 


A basic challenge in smart contract verification is how to express the functional 
correctness of transactions, such as currency minting or transferring between 
accounts. Typically, the correctness of such a transaction can be verified by prov- 
ing that the transaction leaves the sum of certain account balances unchanged. 

Consider for example the task of minting an unbounded number of tokens in 
the simplified ERC-20 token standard of the Ethereum community [32], as illus- 
trated in Fig. 11. This example deposits the minted amount (n) into the receiver’s 
address (a) and we need to ensure that the mint operation only changed the bal- 


1 The old- prefix denotes the value of a function before the mint transition, and the 
new- prefix denotes the value afterwards. 
© The Author(s) 2021 
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a: Address 
n: Nat 


mint (a,n) 


# Post-conditions 


assert new-bal(a) = old-bal(a) + n #(i) 

for each Address a’ Æ a: #(ii) 
assert new-bal(a’) = old-bal(a’) 

assert new-sum() = old-sum() + n # (iii) 


Fig. 1. Minting n tokens in ERC-20. 


ance of the receiver. To do so, in addition to (i) proving that the balance of the 
receiver has been increased by n, we also need to verify that (ii) the account 
balance of every user address a’ different than a has not been changed during 
the mint operation and that (iii) the sum of all balances changed exactly by 
the amount that was minted. The validity of these three requirements (i)-(iii), 
formulated as the post-conditions of Fig. 1, imply its functional correctness. 

Surprisingly, proving formulas similar to the post-conditions of Fig. 1 is chal- 
lenging for state-of-the-art automated reasoners, such as SMT solvers [6,7,9] and 
first-order provers [11,19,34]: it requires reasoning that links local changes of the 
receiver (a) with a global state capturing the sum of all balances, as well as con- 
structing that global state as an aggregate of an unbounded but finite number 
of Address balances. Moreover, our encoding of the problem uses discrete coins 
that are minted and deposited, whose number is unbounded but finite as well. 

In this paper we address verification challenges of software transactions with 
aggregate properties, such as preservation of sums by transitions that manipulate 
low-level, individual entities. Such properties are best expressed in higher-order 
logic, hindering the use of existing automated reasoners for proving them. To 
overcome such a reasoning limitation, we introduce Sum Logic (SL) as a gen- 
eralization of first-order logic, in particular of Presburger arithmetic. Previous 
works [12,21,31] have also introduced extensions of first-order logic with aggre- 
gates by counting quantifiers or generalized quantifiers. In Sum Logic (SL) we 
only consider the special case of integer sums over uninterpreted functions, allow- 
ing us to formalize SL properties with and about unbounded sums, in particular 
sums of account balances, without higher-order operations (Sect. 3). We prove 
the decidability of one of our SL extensions and the undecidability of a slightly 
richer one (Sect. 4). Given previous results [21], our undecidability result is not 
surprising. In contrast, what may be unexpected is our decidability result and 
the fact that we can use our first-order fragment for a convenient and practical 
new way to verify the correctness of smart contracts. 

We further introduce first-order encodings which enable automated reason- 
ing over software transactions with summations in SL (Sect.5). Unlike [5], 
where SMT-specific extensions supporting higher-order reasoning have been 
introduced, the logical encodings we propose allow one to use existing reason- 
ers without any modification. We are not restricted to SMT reasoning, but can 
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also leverage generic automated reasoners, such as first-order theorem provers, 
supporting first-order logic. We believe our results ease applying automated rea- 
soning to smart contract verification even for non-experts. 

We demonstrate the practical applicability of our results by using SMT 
solvers and first-order provers for validating the correctness of common financial 
transitions appearing in smart contracts (Sect.6). We refer to these transitions 
as smart transitions. We encode SL into pure first-order logic by adding another 
sort that represents the tokens of the crypto-currency themselves (which we dub 
“coins” ). 

Although the encodings of Sect.5 do not translate to our decidable SL 
fragment from Sect.4, our experimental results show that automated reason- 
ing engines can handle them consistently and fast. The decidability results of 
Sect. 5 set the boundaries for what one can expect to achieve, while our exper- 
iments from Sect.5 demonstrate that the unknown middle-ground can still be 
automated. 

While our work is mainly motivated by smart contract verification, our results 
can be used for arbitrary software transactions implementing sum/aggregate 
properties. Further, when compared to the smart contract verification frame- 
work of [33], we note that we are not restricted to proving the correctness of 
smart contracts as finite-state machines, but can deal with semantic properties 
expressing financial transactions in smart contracts, such as currency minting/- 
transfers. 

While ghost variable approaches [14] can reason about changes to the global 
state (the sum), our approach allows the verifier to specify only the local changes 
and automatically prove the impact on the global state. 


Contributions. In summary, this paper makes the following contributions: 


— We present a generalization to Presburger arithmetic (SL, in Sect.3) that 
allows expressing properties about summations. We show how we can formal- 
ize verification problems of smart contracts in SL. 

— We discuss the decidability problem of checking validity of SL formulas 
(Sect. 4): we prove that it is undecidable in the general case, but also that 
there exists a small decidable fragment. 

— We show different encodings of SL to first-order logic (Sect. 5). To this end, 
we consider theory-specific reasoning and variations of SL, for example by 
replacing non-negative integer reasoning with term algebra properties. 

— We evaluate our results with SMT solvers and first-order theorem provers, 
by using 31 new benchmarks encoding smart transitions and their proper- 
ties (Sect.6). Our experiments demonstrate the applicability of our results 
within automated reasoning, in a fully automated manner, without any user 
guidance. 


2 Preliminaries 


We consider many-sorted first-order logic (FOL) with equality, defined in the 
standard way. The equality symbol is denoted by ~. 
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We denote by STRUCT [X] the set of all structures for the vocabulary X. A 
structure A € STRUCT [X] is a pair (D,Z), where for each sort s, its domain 
in A is D(s), and for each symbol S, its interpretation in A is Z(S). Note that 
models of a formula y over a vocabulary X are structures A € STRUCT [X]. 

A first-order theory is a set of first-order formulas closed under logical con- 
sequence. We will consider, the first-order theory of the natural numbers with 
addition. This is Presburger arithmetic (PA) which is of course decidable [27]. 
We write N to denote the set of natural numbers. We consider 0 € N and 
write N+ to explicitly exclude 0 from N. The vocabulary of PA is »/Presburger = 
(0,1,¢c1,...,c, +°), with all constants 0,1,c; of sort Nat. A structure A = 
(D,Z) € STRUCT [Xpresburger] is called a Standard Model of Arithmetic when 
D(Nat) = N and +? is interpreted as the standard binary addition + func- 
tion over the naturals. The vocabulary Ypresburger Can be extended with a total 
order relation, yielding XB,esburger = (0,1,+°,<?), where <? is interpreted as 
the binary relation < in Standard Models of Arithmetic. 


3 Sum Logic (SL) 


We now define Sum Logic (SL) as a generalization of Presburger arithmetic, 
extending Presburger arithmetic with unbounded sums. SL is motivated by 
applications of financial transactions over cryptocurrencies in smart contracts. 
Smart contracts are decentralized computer programs executed on a blockchain- 
based system, as explained in [28]. Among other tasks, they automate financial 
transactions such as transferring and minting money. We refer to these trans- 
actions as smart transitions. The aim of this paper and SL in particular is to 
express and reason about the post-conditions of smart transitions similar to 
Fig. 1. 

SL expresses smart transition relations among sums of accounts of various 
kinds, e.g., at different banks, times, etc. Each such kind, j, is modeled by an 
uninterpreted function symbol, bj, where b;(a) denotes the balance of a’s account 
of kind j, and a constant symbol s;, which denotes the sum of all outputs of 
b;. As such, our SL generalizes Presburger arithmetic with (i) a sort Address 
corresponding to the (unbounded) set of account addresses; (ii) balance functions 
b; mapping account addresses from Address to account values of sort Nat; and 
(iii) sum constants s; of sort Nat capturing the total sum of all account balances 
represented by bj. Formally, the vocabulary of SL is defined as follows. 


Definition 1 (SL Vocabulary). Let 


Lm,d __ 1 1 2 <2 
Se = (Gig sas abh,- bh C1,- -<3 Cd 51; -- -3 5m0, L, + is ) 


be a sorted first-order vocabulary of SL over sorts {Address, Nat}, where 
— (Addresses) The constants a1,...,a, are of sort Address; 


- (Balance functions) bt,...,b4, are unary function symbols from Address to 
Nat; 
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Table 1. ERC-20 token standard 


Function Encoding in SL Reference in ERC-20 
sum sor 8’ totalSupply 

bal(a) b(a) or b'(a) balanceOf 

mint(a, v) b'(a) ~ b(a) +0 transfer 
transferFrom(f,t,v) | b/(t) ~ b(t) +v Ab(f) =b (f)+v | transferFrom 


- (Constants and Sums) The constants c1, ...,Cd,S1,---,Sm aNd 0,1 are of sort 

Nat; 

- +° is a binary function Nat x Nat — Nat; 
- <? is a binary relation over Nat x Nat. 

In what follows, when the cardinalities in an SL vocabulary are clear from 
context, we simply write X instead of Dae ar Further, by we we denote the 
sub-vocabulary where the crossed-out symbols are not available. Note that even 
when addition is not available, we still allow writing numerals larger than 1. 

We restrict ourselves to universal sentences over an SL vocabulary, with 
quantification only over the Address sort. 

We now extend the Tarskian semantics of first-order logic to ensure that the 
sum constants of an SL vocabulary (s1,..., Sm) are equal to the sum of outputs 
of their associated balance functions (bj for each sj) over the respective entire 
domains of sort Address. 

Let X be an SL vocabulary. An SL structure A = (D,Z) € STRUCT [X] 
representing a model for an SL formula g is called an SL model iff 


L(s;) = 5 [Z(b;)] (a), for each 1< j <m. (Sum Property) 


a€D(Address) 


We write A Fs, y to mean that A is an SL model of y. When it is clear 
from context, we simply write A F y. 


Example 1 (Encoding ERC-20 in SL). As a use case of SL, we showcase the 
encoding of the ERC-20 token standard of the Ethereum community [32] in SL. 
To this end, we consider an SL vocabulary X24, We respectively denote the 
balance functions and their associated sums as b,b’,s,s’ in the SL structure 
over Xb>4, The resulting instance of SL can then be used to encode ERC-20 
operations/smart transitions as SL formulas, as shown in Table 1. Using this 
encoding, the post-condition of Fig. 1 is expressed as the SL formula 


b'(a) © D(a) +n A Val £a.U'(a') = O(a’) As stn (1) 
formalizing the correctness of the smart transition of minting n tokens in Fig. 1. 
In the applied verification examples in Sect. 6, rather than verifying the low-level 
implementation of built-in functions such as mint, we assume their correctness 
by including suitable axioms. 
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4 Decidability of SL 


We consider the decidability problem of verifying formulas in SL. We show that 
when there are several function symbols bj to sum over, the satisfiability prob- 
lem for SL becomes undecidable?. We first present, however, a useful decidable 
fragment of SL. 


4.1 A Decidable Fragment of SL 


We prove decidability for a fragment of SL, which we call the (1,1,d)-FRAG 
fragment of SL (Theorem 4). For doing so, we reduce the fragment to Presburger 
arithmetic, by using regular Presburger constructs to encode SL extensions, that 
is the uninterpreted functions and sum constants of SL. 

The first step of our reduction proof is to consider distinct models, which 
are models where the Address constants a; represent distinct elements in the 
domain D(Address). While this restriction is somewhat unnatural, we show that 
for each vocabulary and formula that has a model, there exists an equisatisfiable 
formula over a different vocabulary that has a distinct model (Theorem 1). The 
crux of our decidability proof is then proving that (/,1,d)-FRAG has small 
Address space: given a formula y, if it is satisfiable, then there exists a model 
where |D(Address)| < «(|y|), |p| is the length of y, and «(.) is some computable 
function (Theorem 3)?. 


Distinct Models. An SL structure A is considered distinct when the l Address 
constants represent l distinct elements in D(Address). I.e., 


\{Z(a1),...,Z(ar)}| =L. 


Since each SL model induces an equivalence relation over the Address constants, 
we consider partitions P over {a1,...,a;}. For each possible partition P we define 
a transformation of terms and formulas Tp that substitutes equivalent Address 
constants with a single Address constant. The resulting formulas are defined 
over a vocabulary that has |P| Address constants. We show that given an SL 
formula y, if y has a model, we can always find a partition P such that each of 
its classes corresponds to an equivalence class induced by that model. 


Theorem 1 (Distinct Models). Let y be an SL formula over X, then y has a 
model iff there exists a partition P of {a1,...,ai} such that Tp(y) has a distinct 
model. 


Small Address Space. In order to construct a reduction to Presburger arith- 
metic, we bound the size of the Address sort. For a fragment of SL to be 
decidable, we therefore need a way to bound its models upfront. We formalize 
this requirement as follows. 


? Proofs of our results are given in the appendix of [10]. 
3 The function «(.) is defined per decidable fragment of SL, and not per formula. 
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Definition 2 (Small Address Space). Let FRAG be some fragment of SL 
over vocabulary X = Dor. FRAG is said to have small Address space if there 
exists a computable function Ky(.), such that for any SL formula p € FRAG, 
y has a distinct model iff p has a distinct model A = (D, T) with small Address 
space, where |D(Address)| < ks(|ọ|). 

We call Ky(.) the bound function of FRAG; when the vocabulary is clear 
from context we simply write K(.). 


One instance of a fragment (or rather, family of fragments) that satisfies this 
property is the (J,1,d)-FRAG fragment: the simple case of a single uninter- 
preted “balance” function (and its associated sum constant), further restricted 
by removing the binary function + and the binary relation <. Therefore, we 
derive the following theorem: 


Theorem 2 (Small Address Space of (1,1, d)-FRAG). 
For any l, d, it holds (l,1,d)-FRAG, the fragment of SL formulas over the 
SL vocabulary 
yuh — (ay, ody tiny OY Cline vay Cd, 5,0, 1) i 


AE 


has small Address space with bound function K(x) =l+a+41. 


An attempt to trivially extend Theorem 2 for a fragment of SL with two 
balance functions falls apart in a few places, but most importantly when com- 
paring balances to the sum of a different balance function. In Sect. 4.2 we show 
that these comparisons are essential for proving our undecidability result in SL. 


Presburger Reduction. For showing decidability of some FRAG fragment of 
SL, we describe a Turing reduction to pure Presburger arithmetic. We introduce 
a transformation 7(.) of formulas in SL into formulas in Presburger arithmetic. 
It maps universal quantifiers to disjunctions, and sums to explicit addition of all 
balances. In addition, we define an auxiliary formula 7(y), which ensures only 
valid addresses are considered, and that invalid addresses have zero balances. 
The formal definitions of 7(.) and (y) can be found in [10]. 

By relying on the properties of distinctness and small Address space we get 
the following results. 


Theorem 3 (Presburger Reduction). An SL formula p has a distinct, 
SL model with small Address space iff T(~) A n(y) has a Standard Model of 
Arithmetic. 


Theorem 4 (SL Decidability). Let FRAG be a fragment of SL that has 
small Address space, as defined in Definition 2. Then, FRAG is decidable. 


Proof (Theorem 4). Let y be a formula in FRAG. Then y has an SL model iff 
for some partition P of {a1,..., a1}, Tp(y) has a distinct SL model. For any P, 
the formula Tp(y) is in FRAG, therefore Tp(y) has a distinct SL model iff it 
has a distinct SL model with small Address space. 


324 N. Elad et al. 


From Theorem 3, we get that for any P, pp = Tp(y) has a distinct SL model 
iff (ye) An(yp) has a Standard Model of Arithmetic. By using the PA decision 
procedure as an oracle, we obtain the following decision procedure for a FRAG 
formula vp: 


— For each possible partition P of {a,...,az}, let pp = Tp(y); 

— Using a PA decision procedure, check whether t(yp) A (yp) has a model, 
for each P; 

— If a model for some partition P was found, the formula yp has a distinct SL 
model, and therefore y has SL model; 

— Otherwise, there is no distinct SL model for any partition P, and therefore 
there is no SL model for vy. 


Remark 1. Our decision procedure for Theorem 4 requires B; Presburger queries, 
where B; is Bell’s number for all possible partitions of a set of size l. 


Using Theorem 4 and Theorem 2, we then obtain the following result. 


Corollary 1. (I,1,d)-FRAG is decidable. 


4.2 SL Undecidability 


We now show that simple extensions of our decidable (1,1,d)-FRAG fragment 
lose its decidability (Theorem 5). For doing so, we encode the halting problem 
of a two-counter machine using SL with 3 balance functions, thereby proving 
that the resulting SL fragment is undecidable. 

Consider a two-counter machine, whose transitions are encoded by the Pres- 
burger formula 7(c1, C2, P, C4, C9, p’) with 6 free variables: 2 for each of the three 
registers, one of which being the program counter (PC). We assume w.l.o.g. that 
all three registers are within Nt, allowing us to use addresses with a zero balance 
as a special “separator”. In addition, we assume that the program counter is 1 
at the start of the execution, and that there exists a single halting statement at 
line H. That is, the two-counter machine halts iff the PC is equal to H. 


Reduction Setting. We have 4 Address elements for each time-step, 3 of 
them hold one register each, and one is used to separate between each group of 
Address elements (see Table 2). We have 3 uninterpreted functions from Address 
to Nat (“balances”). For readability we denote these functions as c,1,g (instead 
of b1, b2,b3) and their respective sums as Se, S1, Sg: 


1. Function c: Cardinality function, used to force size constraints. We set its 
value for all addresses to be 1, and therefore the number of addresses is se. 

2. Function l: Labeling function, to order the time-steps. We choose one element 
to have a maximal value of se — 1 and ensure that l is injective. This means 
that the values of l are distinctly [0, se — 1]. 

3. Function g: General purpose function, which holds either one of the registers 
or 0 to mark the Address element as a separating one. 
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Table 2. Transition system of a 2-counter machine, array view. 


Address | I(Address) | c(Address) g(Address) 
0 1 0 
. I 1 cı at #0 
Time-step #0 5} I C2 at #0 
Go 3 1 PC at #0 = 1 
Tı 4i 1 0 
o l T2 4i+1 I c at #i 
Time-step #7 es qi 42 1 co at #i 
LA Mi +3 1 PC at #i 
T5 4i+4 il 0 
OS l aa 4i +5 T c at #(i +1) 
Time-step #(i + 1) ae Lb I c2 at #(i + 1) 
= ai? I PC at #(i + 1) 
Se —-4 1 0 
g 8c Sc —3 1 Ga Hn 
Time-step #n = == — 1 s? I co at #n 
a so—1 1 PC at #n = H 


Each group representing a time-step is a 4 Address element, ordered as follows: 
1. First, a separating Address element x (where g(x) is 0). 

2. Then, the two general-purpose counters. 

3. Lastly, the program counter. 


In addition we have 2 Address constants, ao and a; which represent the PC 
value at the start and at the end of the execution. The element a, also holds 
the maximal value of l, that is, [(a,) + 1 % se. Further, ap holds the fourth- 
minimal value, since its the last element of the first group, and each group has 
four elements. 


Formalization Using a Two-Counter Machine. We now formalize our 
reduction, proving undecidability of SL. 
(i) We impose an injective labeling 

gi = Yz, y. (U(x) ~ I(y)) > (z ~ y) 


(ii) We next formalize properties over the program counter PC. The Address 
constant that represents the program counter PC value of the last time-step is 
set to have the maximal labeling, that is 


P2 = Va.l(x) < I(a1) 


Further, the Address constant that represents the PC value of the first time-step 
has the fourth labeling, hence 


93 = I(ao) z3 


326 N. Elad et al. 


Finally, the first and last values of the program counter are respectively 1 and 
H, that is 

p4 = glao) S 1 A g(a) ~ H 
(iii) We express cardinality constraints ensuring that there are as many Address 
elements as the labeling of the last Address constant (a1) + 1. We assert 


ps = (se ~ l(a) + 1) A Va. (c(x) ~ 1) 


(iv) We encode the transitions of the two-counter machine, as follows. For every 
8 Address elements, if they represent two sequential time-steps, then the formula 
for the transitions of the two-counter machine is valid for the registers it holds. 
As such, we have 


eo = Yz1,..., £8. (F1 A F2 A F3) 
>r (9(22), g(x3), g(xa), g(x6), g(&7), g(xs)) 
where the conjunction F'1 A F2/ F3 expresses that x1,..., £8 are two sequential 
time-steps, with F1, F2 and F3 defined as below. In particular, F1, F2 and 
F3 formalize that x1,...,x2g have sequential labeling, starting with one zero- 


valued Address element (“separator”) and continuing with 3 non-zero elements, 
as follows: 


— Sequential: 
Uae) = (a1) +1A---Al(ag) & (a7) +1 (F1) 

— Time-steps: 
glxı) OA g(a2) > OA g(a3) > OA g(a4) > 0, (F2) 
g(t5) = OA g(z6) > OA g(z7) > OA g(ag) > 0 (F3) 


Based on the above formalization, the formula y = p1 A++- A Ye is satisfiable 
iff the two-counter machine halts within a finite amount of time-steps (and the 
exact amount would be given by *¢). Since the halting problem for two-counter 
machines is undecidable, our SL, already with 3 uninterpreted functions and 
their associated sums, is also undecidable. 


Theorem 5. For any l > 2,m > 3 and d, any fragment of SL over ya is 
undecidable. z 


Remark 2. Note that in the above formalization the only use of associated sums 
comes from expressing the size of the set of Address elements. As for our unin- 
terpreted function c(.) we have Vx.c(x) ~% 1, its sum se is thus the amount of 
addresses. Hence, we can encode the halting problem for two-counter machines 
in an almost identical way to the encoding presented here, using a generalization 
of PA with two uninterpreted functions for I(.) and g(.), and a size operation 
replacing c(.) and its associated sum. 
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5 SL Encodings of Smart Transitions 


The definition of SL models in Sects. 3 and 4 ensured that the summation con- 
stants sj were respectively equal to the actual summation of all balances b,(.). In 
this section, we address the challenge to formalize relations between s; and 6,(.) 
in a way that the resulting encodings can be expressed in the logical frameworks 
of automated reasoners, in particular of SMT solvers and first-order theorem 
provers. 

In what follows, we consider a single transaction or one time-step of multiple 
transactions over s,;,b;(.). We refer to such transitions as smart transitions. 
Smart transitions are common in smart contracts, expressing for example the 
minting and/or transferring of some coins, as evidenced in Fig. 1 and discussed 
later. 

Based on Sect.3, our smart transitions are encoded in the X524 fragment 
of SL. Note however, that neither decidability nor undecidability of this frag- 
ment is implied by Theorem 4, nor Theorem 5. In this section, we show that 
our SL encoding of smart transitions is expressible in first-order logic. We first 
introduce a sound, implicit SL encoding, by “hiding” away sum semantics and 
using invariant relations over smart transitions (Sect. 5.1). This encoding does 
not allow us to directly assert the values of any balance or sum, but we can 
prove that this implicit encoding is complete, relative to a translation function 
(Sect. 5.2). 

By further restricting our implicit SL encoding to this relative complete 
setting, we consider counting properties to explicitly reason with balances and 
directly express verification conditions with unbounded sums on sj and 6,(.). 
This is shown in Sect. 5.3, and we evaluate different variants of the explicit SL 
encoding in Sect.6, showcasing their practical use and relevance within auto- 
mated reasoning. 

To directly present our SL encodings and results in the smart contract 
domain, in what follows we rely on the notation of Table 1. As such, we respec- 
tively denote b,b’ by old-bal, new-bal and write old-sum, new-sum for s,s’. As 
already discussed in Fig. 1, the prefixes old- and new- refer to the entire state 
expressed in the encoding before and after the smart transition. We explicitly 
indicate this state using old-world, new-world respectively. The non-prefixed 
versions bal and sum are stand-ins for both the old- and new- versions—Fig. 2 
illustrates our setting for the smart transition of minting one coin. 

With this SL notation at hand, we are thus interested in finding first-order 
formulas that verify smart transition relations between old-sum and new-sunm, 
given the relation between old-bal and new-bal. In this paper, we mainly focus 
on the smart transitions of minting and transferring money, yet our results could 
be used in the context of other financial transactions/software transitions over 
unbounded sums. 


12, 


Example 2. In the case of minting n coins in Fig. 1, we require formulas that (a) 
describe the state before the transition (the old-world, thus pre-condition), (b) 
formalize the transition (the relation between old-bal and new-bal; (i)-(ii) in 
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old-world new-world 
old-sum new-sum 
y 
Nat new-sum = Nat 

A old-sum+i Kn 

cA D 

oe N 
d ` 
old-active new-active 
ee 
Coin ~ Bool (M1) A(M3) Coin + Bool 


5 new-bal 
= new-sum 


inv(old-has-coin, old-active) inv(new-has-coin,new-active)| 


old-has-coin (M2) (M4) new-has-coin 
AddrxCoin — Bool AddrCoin — Bool 


K A 
a Ti 
N a 
> new-bal(a) = 2 


Cur bal old-bal(a)+1 new-bal 
— Nat Addr — Nat 


Fig. 2. Implicit SL encoding of mint;, where Addr is short for Address. 


Fig. 1) and (c) imply the consequences for the new-world ((iii) in Fig. 1). These 
formulas verify that minting and depositing n coins into some address result in 
an increase of the sum by n, that is new-sum = old-sum+n, as expressed in the 
functional correctness formula (1) of Fig. 1. 


5.1 SL Encoding Using Implicit Balances and Sums 


The first encoding we present is a set of first-order formulas with equality over 
sorts {Coin, Address}. No additional theories are considered. The Coin sort 
represents money, where one coin is one unit of money. The Address sort rep- 
resents the account addresses as before. As a consequence, balance functions 
and sum constants only exist implicitly in this encoding. As such, the property 
sum = ) > cagaress Pal(a) cannot be directly expressed in this encoding. Instead, 
we formalize this property by using so-called smart invariant relations between 
two predicates has-coin and active over coins c € Coin and a € Address, as 
follows. 


Definition 3 (Smart Invariants). Let has-coin C Address x Coin and con- 
sider active C Coin. A smart invariant of the pair (has-coin, active) is the 
conjunction of the following three formulas 


1. Only active coins c can be owned by an address a: 


Vc : Coin. Ja : Address. has-coin(a,c) — active(c) . (11) 
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2. Every active coin c belongs to some address a: 


Vc: Coin. active(c) — da: Address. has-coin(a,c) . (12) 


3. Every coin c belongs to at most one address a: 
Ve: Coin.Va,a’ : Address. 


(has-coin(a,c) A has-coin(a’,c) >a ~ a’). (13) 


We write inv(has-coin, active) to denote the smart invariant (I1)A (I2)A (13) 
of (has-coin, active). 


Intuitively, our smart invariants ensure that a coin c is active iff it is owned 
by precisely one address a. Our smart invariants imply the soundness of our 
implicit SL encoding, as follows. 


Theorem 6 (Soundness of SL Encoding). Given that sum = jactive| and 
for every a € Address it holds bal(a) = |{c € Coin | (a,c) € has-coin}|, then 
inv(has-coin, active) => sum = )) cagaress Dal (a). 


We say that a smart transition preserves smart invariants, when 
inv(old-has-coin, old-active) 
<= inv(new-has-coin,new-active), 


where old-has-coin,old-active and new-has-coin, new-active respec- 
tively denote the functions has-coin, active in the states before and after the 
smart transition. Based on the soundness of our implicit SL encoding, we for- 
malize smart transitions preserving smart invariants as first-order formulas. We 
only discuss smart transitions implementing minting n coins here, but other 
transitions, such as transferring coins, can be handled in a similar manner. We 
first focus on miniting a single coin, as follows. 


Definition 4 (Transition mintj(a,c)). Let there be c E€ Coin,a € Address. 
The transition minti(a,c) activates coin c and deposits it into address a. 


1. The coin c was inactive before and is active now: 
sold-active(c) A new-active(c) . (M1) 
2. The address a owns the new coin c: 
new-has-coin(a,c) \ Va’ : Address. s0ld-has-coin(a’,c) . (M2) 
3. Everything else stays the same: 


Vel : Coin. d £ c — (new-active(c’) + old-active(c’)) , (M3) 
Ve: Coin. Va’ : Address. (c &cVa' #a) > (M4) 


(new-has-coin(a’,c’) > old-has-coin(a’,c’)) . 


The transition mint: (a,c) is defined as (M1) A (M2) ^ (M8) ^ (M4). 
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By minting one coin, the balance of precisely one address, that is of 
the receiver’s address, increases by one, whereas all other balances remain 
unchanged. Thus, the expected impact on the sum of account balances is also 
increased by one, as illustrated in Fig. 2. The following theorem proves that the 
definition of mint, is sound. That is, mint, affects the implicit balances and 
sums as expected and hence mint, preserves smart invariants. 


Theorem 7 (Soundness of mint(a,c)). Let c € Coin, a € Address such 
that mint,(a,c). Consider balance functions old-bal, new-bal : Address — N, 
non-negative integer constants old-sum, new-sum, unary predicates old-active, 
new-active C Coin and binary predicates old-has-coin, new-has-coin C 
Address x Coin such that 


|old-active|] = old-sum, |new-active| = new-sum, 
and for every address a’, we have 
old-bal(a’) = |{c’ € Coin | (a’,c’) € old-has-coin}| , 
new-bal(a’) = |{c’ € Coin | (a’,c’) E€ new-has-coin}| . 


Then, new-sum = old-sum+ 1, new-bal(a) = old-bal(a) +1. Moreover, for 
all other addresses a’ # a, it holds new-bal(a’) = old-bal(a’). 


Smart transitions minting an arbitrary number of n coins, as in our Fig. 1, is 
then realized by repeating the mint, transition n times. Based on the soundness 
of mint, ensuring that mint, preserves smart invariants, we conclude by induc- 
tion that n repetitions of mint,, that is minting n coins, also preserves smart 
invariants. The precise definition of mint, together with the soundness result 
is stated in [10]. 


5.2 Completeness Relative to a Translation Function 


Smart invariants provide sufficient conditions for ensuring soundness of our SL 
encodings (Theorem 6). We next show that, under additional constraints, smart 
invariants are also necessary conditions, establishing thus (relative) completeness 
of our encodings. 

A straightforward extension of Theorem 6 however does not hold. Namely, 
only under the assumptions of Theorem 6, the following formula is not valid: 


sum = 5 bal(a) <—  inv(has-coin, active). 
a€Address 


As a counterexample, assume (i) sum = jactivel, (ii) for every a € Address 
it holds that bal(a) = |{c € Coin | (a,c) E€ has-coin}|, that is the assumptions 
of Theorem 6. Further, let (iii) the smart invariants inv(has-coin, active) hold 
for all but the coins c,,cg € Coin and all but the addresses a;,a2 € Address. 
We also assume that (iv) cı is active but not owned by any address and (v) c2 
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is active and owned by the two distinct addresses a,,a2. We thus have sum = 
J acadaress PAl(a), yet inv(has-coin, active) does not hold. 

To ensure completeness of our encodings, we therefore introduce a translation 
function f that restricts the set F 4 24ddressxCoin  9Coin of (has-coin, active) 
pairs, as follows. We exclude from F those pairs (has-coin, active) that 
violate smart invariants by both (i) not satisfying (12), as (I2) ensures that 
there are not too many active coins, and by (ii) not satisfying at least one 
of (I1) and (I3), as (I1) and (I3) ensure that there are not too few active 
coins. The required translation function f (as in [10]) now assigns every pair 
(bal, sum) the set of all (has-coin,active) € F that satisfy sum = |activel, 
bal(a) = |{c € Coin | has-coin(a,c)}| for every address a and have not been 
excluded. 


Theorem 8 (Relative Completeness of SL Encoding). Let (bal, sum) € 
NAsdress x N and let (has-coin, active) € f(bal, sum) be arbitrary. Then, 


sum = 5 bal(a) <= inv(has-coin, active). 
ac€Address 


5.3 SL Encodings Using Explicit Balances and Sums 


We now restrict our SL encoding from Sect. 5.1 to explicitly reason with balance 
functions during smart transitions. We do so by expressing our translation func- 
tion f from Sect. 5.2 in first-order logic. We now use the summation constant 
sum € N and the balance function bal : Address — N in our SL encoding. In 
particular, we use our smart invariants inv(has-coin, active) in this explicit 
SL encoding together with two additional axioms (Axl, Ax2), ensuring that 
sum = jactive| and bal(a) = |{c € Coin | has-coin(a,c)}| for all a € Address. 

To formalize the additional properties, we introduce two counting mecha- 
nisms in our SL encoding. The first one is a bijective function count : Coin — Nt 
and the second one is a function idx : Address x Coin —> Nt, where idx(a,.) : 
Coin — N* is bijective for every a € Address. To ensure that count and idx(a, .) 
count coins, we impose the following two properties: 


Vc : Coin. active(c) <= count(c) < sum, (Ax1) 


Vc : Coin. Va: Address. has-coin(a,c) <> idx(a,c) < bal(a) . (Ax2) 


Figure 3 illustrates our revised SL encoding for our smart transition minty. 
We next ensure soundness of our resulting explicit encoding for summation, as 
follows. 


Theorem 9 (Soundness of Explicit SL Encodings). Let there be a pair 
(bal, sum) € N4dress x N, a pair (has-coin, active) € F, and functions count : 
Coin — N* and idx : Address x Coin — N?. 
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Given that count is bijective, idx(a,.) : Coin — N* is bijective for every a € 
Address, and that (Ar1), (Ax2) and inv (has-coin, active) hold, then, sum = 
|active| and bal(a) = |{c € Coin: has-coin(a,c)}|, for every a € Address. 

In particular, we have sum = > bal(a). 


a€Address 


When compared to Sect. 5.1, our explicit SL encoding introduced above uses 
our smart invariants as axioms of our encoding, together with (Ax1) and (Ax2). 
In our explicit SL encoding, the post-conditions asserting functional correct- 
ness of smart transitions express thus relations among old-sum to new-sum. For 
example, for mint,, we are interested in ensuring 


mint, > new-sum = old-sum+n. (2) 


By using two new constants old-total, new-total € N, we can use sum = 
total as smart invariant for mint,,. As a result, the property to be ensured is 
then 


(old-sum = old-total ^ new-total = old-total + n ^ mint,,) (3) 
=> (new-sum = new-total) . 

It is easy to see that the negations of (2) and (3) are equisatisfiable. We note 
however that the additional constants old-total, new-total used in (3) lead 
to unstable results within automated reasoners, as discussed in Sect. 6. 


old-sum new-sum 
Nat new-sum = Nat 


old-sum+1 
(P1) (P1) 


old-active new-active 
Coin + Bool Coin + Bool 
old-world inv(old-has-coin,old-active) inv(new-has-coin,new-active) new-world 
old-has-coin new-has-coin 
AddrxCoin — Bool AddrxCoin — Bool 
P2) (P2) 
J new-bal(a) = az 


old-bal old-bal(a)+1 new-bal 
_—— o o 
Addr — Nat Addr — Nat 


Fig. 3. Explicit SL encoding of mintı, where Addr is short for Address. 
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6 Experiments 


From Theory to Practice. To make our explicit SL encodings handier for 
automated reasoners, we improved the setting illustrated in Fig.3 by applying 
the following restrictions without losing any generality. 


(i) The predicates has-coin and active were removed from the explicit SL 
encodings, by replacing them by their equivalent expressions (Ax1)-(Ax2). 

(ii) The surjectivity assertions of count and idx were restricted to the relevant 
intervals [1, sum], [1,bal(a)] respectively. 

(iii) Compared to Fig.3, only one mutual count and one mutual idx functions 
were used. We however conclude that we do not lose expressivity of our resulting 
SL encoding, as shown in [10]. 

(iv) When our SL encoding contains expressions such as Vc : Coin. idx(ag,c) € 
(lo, uo] <> idx(ar,c) € [h, u1], with ao, a, being distinct addresses such that 
either u; < bal(a;) or l; > bal(a;), i € {0,1}, then it can be assumed that the 
coins in those intervals are in the same order for both functions [10]. 


Based on the above, we derive three different explicit SL encodings to be 
used in automated reasoning about smart transitions. We respectively denote 
these explicit SL encodings by int, nat and id, and describe them next. 


Benchmarks. In our experiments, we consider four smart transitions minty, 
mint,, transferFrom, and transferFrom,, respectively denoting minting and 
transferring one and n coins. These transitions capture the main operations of 
linear integer arithmetic. In particular, mint, implements the smart transition 
of our running example from Fig. 1. 

For each of the four smart transitions, we implement four SL encodings: the 
implicit SL encoding uf from Sect.5.1 using only uninterpreted functions and 
three explicit encodings int, nat and id as variants of Sect.5.3. We also con- 
sider three additional arithmetic benchmarks using int, which are not directly 
motivated by smart contracts. Together with variants of int and nat presented 
in the sequel, our benchmark set contains 31 examples altogether, with each 
example being formalized in the SMT-LIB input syntax [1]. In addition to our 
encodings, we also proved consistency of the axioms used in our encodings. 


CN — 
next next next 
fon coe O00 
/ 


#1 #2 


Fig. 4. Linked lists in id. 


SL Encodings and Relaxations. Our explicit SL encoding int uses linear 
integer arithmetic, whereas nat and id are based on natural numbers. As nat- 
urals are not a built-in theory in SMT-LIB, we assert the axioms of Presburger 
arithmetic directly in the encodings of nat and id. 
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In our id encodings, inductive datatypes are additionally used to order coins. 
There exists one linked list of all coins for count and one for each idx(a,.), 
a € Address. Additionally, there exists a “null” coin, which is the first ele- 
ment of every list and is not owned by any address. As shown in Fig.4, the 
numbering of each coin is defined by its position in the respective list. This 
way surjectivity for count and idx can respectively be asserted by the formu- 
las dc : Coin. count(c) ~ sum and Va: Address. Jc: Coin. idx(a,c) ~ bal(a). 
However, asserting surjectivity for int and nat cannot be achieved without quan- 
tifying over Nt. Such quantification would drastically effect the performance of 
automated reasoners in (fragments of) first-order logics. As a remedy, within 
the default encodings of int and nat, we only consider relevant instances of 
surjectivity. 

Further, we consider variations of int and nat by asserting proper surjectiv- 
ity to the relevant intervals of idx and count (denoted as surj) and/or adding 
the total constants mentioned in Sect.5.3 (denoted as with total, no total) . 
These variations of int and nat are implemented for mint, and transferFrom,. 


Experimental Setting. We evaluated our benchmark set of 31 examples using 
SMT solvers Z3 [7] and CVC4 [6], as well as the first-order theorem prover 
Vampire [19]. Our experiments were run on a standard machine with an Intel 
Core i5-6200U CPU (2.30 GHz, 2.40 GHz) and 8GB RAM. The time is given 
in seconds and we ran all experiments with a time limit of 300s. Time out is 
indicated by the symbol x. The default parameters were used for each solver, 
unless stated otherwise in the corresponding tables*. 


Experimental Analysis. We first report on our experiments using different 
variations of int and nat. Table3 shows that asserting complete surjectivity 
for int and nat is computationally hard and indeed significantly effects the 
performance of automated reasoners. Thus, for the following experiments only 


Table 3. Results of mint; and transferFrom; using nat and int, with/without the 
total constants and with/without surjectivity. 


mint, transferFrom, 
no total Z3 CVC4 Vampire | no total Z3 CVC4 Vampire 
nat 0.02 x 0.92| nat x x 15.35 
nat surj. x x x nat surj. 100.03 x x 
int 0.02 0.03 x int 0.02 0.07 x 
int surj. x 5.96 x int surj. 1.02 x x 
with total Z3 CVC4 Vampire | with total Z3 CVC4 Vampire 
nat 0.03 x 2.92} nat 0.28 x 22.54 
nat surj. 0.11 x x nat surj. 38.24 x x 
int 0.02 0.03 x int 0.02 0.10 x 
int surj. 3.81 5.95 x int surj. x 6.56 x 


4 The precise calls and encodings are available at github.com/SoRaTu/SmartSums. 


Summing up Smart Transitions 335 


Table 4. Smart transitions using implicit /explicit SL encodings. 


Encoding Task 
mint, transferFrom, minty, transferFrom,, 
Z3: 0.01 Z3: 0.02 Z3: x Z3: x 
uf CVC4: 0.02 CVC4: 0.03 CVC4: x CVC4: x 
Vampire: 0.18 Vampire: 0.19 Vampire: 0.35* Vampire: 0.44* 
Z3: 0.02 Z3: x Z3: x Z3: x 
nat CVC4: x CVC4: x CVC4: x CVC4: x 
Vampire: 0.92 Vampire: 15.35 ‘Vampire: 23.231 Vampire: 228.221 
Z3: 0.02 Z3: 0.02 Z3: 0.03 Z3: OLL 
int CVC4: 0.03 CvCa4: 0.07 CVC4: 0.05 CVCA4: 0.35 
Vampire: x Vampire: x Vampire: x Vampire: x 
Z3: x Z3: x Z3: x Z3: x 
id CVvC4: x CVC4: x CVvCa4: x CVvCa4: x 
‘Vampire: 7.36¢ Vampire: 17.16% Vampire: 23.52} Vampire: x 


relevant instances of surjectivity, such as dc : Coin. count(c) = sum were asserted 
in int and nat. Table 3 also illustrates the instability of using the total constant. 
Some tasks seem to be easier even though their reasoning difficulty increased 
strictly by adding additional constants. 

Our most important experimental findings are shown in Table 4, demonstrat- 
ing that our SL encodings are suitable for automated reasoners. Thanks to our 
explicit SL encodings, each solver can certify every smart transition in at least 
one encoding. Our explicit SL encodings are more relevant than the implicit 
encoding uf as we can express and compare any two non-negative integer sums, 
whereas for uf handling arbitrary values n can only be done by iterating over the 
mint, (or transferFrom,) transition. This iteration requires inductive reason- 
ing, which currently only Vampire could do [15], as indicated by the superscript 
x. Nevertheless, the transactions mint, transferFrom,, which involve only one 
coin in uf, require no inductive reasoning as the actual sum is not considered; 
each of our solvers can certify these examples. 

We note that the tasks mint, and transferFrom,, from Table 4 yield a huge 
search space when using their explicit SL encodings within automated reasoners. 
We split these tasks into proving intermediate lemmas and proved each of these 
lemmas independently, by the respective solver. In particular, we used one lemma 
for mint, and four lemmas for transferFrom,. In our experiments, we only 
used the recent theory reasoning framework of Vampire with split queues [13] 
and indicate our results in by superscript Tf. 

We further remark that our explicit SL encoding id using inductive datatypes 
also requires inductive reasoning about smart transitions and beyond. The need 
of induction explains why SMT solvers failed proving our id benchmarks, as 
shown in Table4. We note that Vampire found a proof using built-in induc- 
tion [15] and theory-specific reasoning [13], as indicated by superscript ¢. 

We conclude by showing the generality of our approach beyond smart tran- 
sitions. It in fact enables fully automated reasoning about any two summations 
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Table 5. Arithmetic reasoning in the explicit SL encoding int. 


Task . 
Time 
Transition Impact 
Z3: 0.20 
new-bal(ag) = old-bal(ag) + 3 
new-sum = old-sum CVC4: 1.28 
new-bal(a,) = old-bal(a,) — 3 : 
Vampire: x 
Z3: 0.58 
new-bal(ag) = old-bal(ag) + 4 
new-sum = old-sum+ 2) CVC4: 7.14 
new-bal(a;,) = old-bal(a;,) — 2 . 
Vampire: x 
new-bal(ag) = old-bal(ag) + 5 Z3: 1.52 
new-bal(a1) = old-bal(a1)— 3 | new-sum = old-sum + 1 | CVC4: 155.20 
new-bal(a2) = old-bal(az) — 1 Vampire: x 


Mier 94), ier h(i) of non-negative integer values g(i), h(i) (i € I) over a 
mutual finite set J. The examples of Table 5 affirm this claim. 


7 Related Work 


Smart Contract Safety. Formal verification of smart contracts is an emerging hot 
topic because of the value of the assets stored in smart contracts, e.g. the DeFi 
software [3]. Due to the nature of the blockchain, bugs in smart contracts are 
irreversible and thus the demand for provably bug-free smart contracts is high. 

The K interactive framework has been used to verify safety of a smart con- 
tract, e.g. in [23]. Isabelle [22] was also shown to be useful in manual, interactive 
verification of smart contracts [17]. We, however, focus on automated approaches. 

There are also efforts to perform deductive verification of smart contracts 
both on the source level in languages such as Solidity [4,14,33] and Move [35], 
as well as on the the Ethereum virtual machine (EVM) level [2,29]. This paper 
improves the effectiveness of these approaches by developing techniques for auto- 
matically reasoning about unbounded sums. This way, we believe we support a 
more semantic-based verification of smart contracts. 

Our approach differs from works using ghost variables [14], since we do not 
manually update the “ghost state”. Instead, the verifier needs only to reason 
about the local changes, and the aggregate state is maintained by the axioms. 
That means other approaches assume (a) the local changes and (b) the impact 
on ghost variables (sum), whereas we only assume (a) and automatically prove 
a = b. This way, we reduce the user-guidance in providing and proving (b). 

Our work complements approaches that verify smart contracts as finite state 
machines [33] and methods, like ZEUS [18], using symbolic model checking and 
abstract interpretation to verify generic safety properties for smart contracts. 

The work in [30] provides an extensive evaluation of ERC-20 and ERC-721 
tokens. ERC-721 extends ERC-20 with ownership functions, one of which being 
“approve”. It enables transactions on another party’s behalf. This is independent 
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of our ability to express sums in first-order logic, as the transaction’s initiator is 
irrelevant to its effect. 


Reasoning about Financial Applications. Recently, the Imandra prover intro- 
duced an automated reasoning framework for financial applications [24-26]. Sim- 
ilarly to our approach, these works use SMT procedures to verify and/or gen- 
erate counter-examples to safety properties of low- and high-level algorithms. 
In particular, results of [24-26] include examples of verifying ranking orders in 
matching logics of exchanges, proving high-level properties such as transitivity 
and anti-symmetry of such orders. In contrast, we focus on verifying proper- 
ties relating local changes in balances to changes of the global state (the sum). 
Moreover, our encodings enable automated reasoning both in SMT solving and 
first-order theorem proving. 


Automated Aggregate Reasoning. The theory of first-order logic with aggregate 
operators has been thoroughly studied in [16,21]. Though proven to be strictly 
more expressive than first-order logic, both in the case of general aggregates 
as well as simple counting logics, in this paper we present a practical way to 
encode a weakened version of aggregates (specifically sums) in first-order logic. 
Our encoding (as in Sect.5) works by expressing particular sums of interest, 
harnessing domain knowledge to avoid the need of general aggregate operators. 

Previous works [5,20] in the field of higher-order reasoning do not directly 
discuss aggregates. The work of [20] extends Presburger arithmetic with Boolean 
algebra for finite, unbounded sets of uninterpreted elements. This includes a way 
to express the set cardinalities and to compare them against integer variables, 
but does not support uninterpreted functions, such as the balance functions we 
use throughout our approach. 

The SMT-based framework of [5] takes a different, white-box approach, mod- 
ifying the inner workings of SMT solvers to support higher-order logic. We on the 
other hand treat theorem provers and SMT solvers as black-boxes, constructing 
first-order formulas that are tailored to their capabilities. This allows us to use 
any off-the-shelf SMT solver. 

In [8], an SMT module for the theory of FO(Agg) is presented, which can be 
used in all DPLL-based SAT, SMT and ASP solvers. However, FO(Agg) only 
provides a way to express functions that have sets or similar constructs as inputs, 
but not to verify their semantic behavior. 


8 Conclusions 


We present a methodology for reasoning about unbounded sums in the context 
of smart transitions, that is transitions that occur in smart contracts model- 
ing transactions. Our sum logic SL and its usage of sum constants, instead of 
fully-fledged sum operators, turns out to be most appropriate for the setting of 
smart contracts. We show that SL has decidable fragments (Sect. 4.1), as well 
as undecidable ones (Sect. 4.2). Using two phases to first implicitly encode SL 
in first-order logic (Sect. 5.1), and then explicitly encode it (Sect. 5.3), allows us 
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to use off-the-shelf automated reasoners in new ways, and automatically verify 
the semantic correctness of smart transitions. 

Showing the (un)decidability of the SL fragment with two sets of uninter- 
preted functions and sums is an interesting step for further work, as this fragment 
supports encoding smart transition systems. Another interesting direction of 
future work is to apply our approach to different aggregates, such as minimum 
and maximum and to reason about under which conditions these values stay 
above /below certain thresholds. A slightly modified setting of our SL axioms 
can already handle min/max aggregates in a basic way, namely by using > and 
< instead of equality and dropping the injectivity/surjectivity (respectively) 
axioms of the counting mechanisms. 

Summing upon multidimensional arrays in various ways is yet another direc- 
tion of future research. Our approach supports the summation over all values 
in all dimensions by adding the required number of parameters to the predicate 
idx and by adapting the axioms accordingly. 
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Abstract. Stateless model checking (SMC) is one of the standard 
approaches to the verification of concurrent programs. As scheduling 
non-determinism creates exponentially large spaces of thread interleav- 
ings, SMC attempts to partition this space into equivalence classes and 
explore only a few representatives from each class. The efficiency of this 
approach depends on two factors: (a) the coarseness of the partitioning, 
and (b) the time to generate representatives in each class. For this rea- 
son, the search for coarse partitionings that are efficiently explorable is 
an active research challenge. 

In this work we present RVF-SMC, a new SMC algorithm that uses a 
novel reads-value-from (RVF) partitioning. Intuitively, two interleavings 
are deemed equivalent if they agree on the value obtained in each read 
event, and read events induce consistent causal orderings between them. 
The RVF partitioning is provably coarser than recent approaches based 
on Mazurkiewicz and “reads-from” partitionings. Our experimental eval- 
uation reveals that RVF is quite often a very effective equivalence, as the 
underlying partitioning is exponentially coarser than other approaches. 
Moreover, RVF-SMC generates representatives very efficiently, as the 
reduction in the partitioning is often met with significant speed-ups in 
the model checking task. 


1 Introduction 


The verification of concurrent programs is one of the key challenges in formal 
methods. Interprocess communication adds a new dimension of non-determinism 
in program behavior, which is resolved by a scheduler. As the programmer has 
no control over the scheduler, program correctness has to be guaranteed under 
all possible schedulers, i.e., the scheduler is adversarial to the program and can 
generate erroneous behavior if one can arise out of scheduling decisions. On the 
other hand, during program testing, the adversarial nature of the scheduler is 
to hide erroneous runs, making bugs extremely difficult to reproduce by testing 
alone (aka Heisenbugs [1]). Consequently, the verification of concurrent programs 
rests on rigorous model checking techniques [2] that cover all possible program 
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behaviors that can arise out of scheduling non-determinism, leading to early 
tools such as VeriSoft [3,4] and CHESS [5]. 

To battle with the state-space explosion problem, effective model checking for 
concurrency is stateless. A stateless model checker (SMC) explores the behav- 
ior of the concurrent program by manipulating traces instead of states, where 
each (concurrent) trace is an interleaving of event sequences of the corresponding 
threads [6]. To further improve performance, various techniques try to reduce the 
number of explored traces, such as context bounded techniques [7-10] As many 
interleavings induce the same program behavior, SMC partitions the interleav- 
ing space into equivalence classes and attempts to sample a few representative 
traces from each class. The most popular approach in this domain is partial- 
order reduction techniques [6,11,12], which deems interleavings as equivalent 
based on the way that conflicting memory accesses are ordered, also known as 
the Mazurkiewicz equivalence [13]. Dynamic partial order reduction [14] con- 
structs this equivalence dynamically, when all memory accesses are known, and 
thus does not suffer from the imprecision of earlier approaches based on static 
information. Subsequent works managed to explore the Mazurkiewicz partition- 
ing optimally [15,16], while spending only polynomial time per class. 

The performance of an SMC algorithm is generally a product of two factors: 
(a) the size of the underlying partitioning that is explored, and (b) the total time 
spent in exploring each class of the partitioning. Typically, the task of visiting 
a class requires solving a consistency-checking problem, where the algorithm 
checks whether a semantic abstraction, used to represent the class, has a con- 
sistent concrete interleaving that witnesses the class. For this reason, the search 
for effective SMC is reduced to the search of coarse partitionings for which the 
consistency problem is tractable, and has become a very active research direc- 
tion in recent years. In [17], the Mazurkiewicz partitioning was further reduced 
by ignoring the order of conflicting write events that are not observed, while 
retaining polynomial-time consistency checking. Various other works refine the 
notion of dependencies between events, yielding coarser abstractions [18-20]. 
The work of [21] used a reads-from abstraction and showed that the consistency 
problem admits a fully polynomial solution in acyclic communication topologies. 
Recently, this approach was generalized to arbitrary topologies, with an algo- 
rithm that remains polynomial for a bounded number of threads [22]. Finally, 
recent approaches define value-centric partitionings [23], as well as partitionings 
based on maximal causal models [24]. These partitionings are very coarse, as 
they attempt to distinguish only between traces which differ in the values read 
by their corresponding read events. We illustrate the benefits of value-based 
partitionings with a motivating example. 


1.1 Motivating Example 


Consider a simple concurrent program shown in Fig.1. The program has 98 
different orderings of the conflicting memory accesses, and each ordering corre- 
sponds to a separate class of the Mazurkiewicz partitioning. Utilizing the reads- 
from abstraction reduces the number of partitioning classes to 9. However, when 
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taking into consideration the values that the events can read and write, the num- 
ber of cases to consider can be reduced even further. In this specific example, 
there is only a single behaviour the program may exhibit, in which both read 
events read the only observable value. 


Equivalence classes: 


Mazurkiewicz [15] 98 


Threadı Threadə Threads 
1. w(a, 1) 1. w(a, 1) 1. w(a, 1) 


reads-from [22] 9 
a at) SSRN value-centric [23] 7 
PETR 3. r(y) this work 1 


Fig. 1. Concurrent program and its underlying partitioning classes. 


The above benefits have led to recent attempts in performing SMC using 
a value-based equivalence [23,24]. However, as the realizability problem is NP- 
hard in general [25], both approaches suffer significant drawbacks. In particular, 
the work of [23] combines the value-centric approach with the Mazurkiewicz par- 
titioning, which creates a refinement with exponentially many more classes than 
potentially necessary. The example program in Fig. 1 illustrates this, where while 
both read events can only observe one possible value, the work of [23] further 
enumerates all Mazurkiewicz orderings of all-but-one threads, resulting in 7 par- 
titioning classes. Separately, the work of [24] relies on SMT solvers, thus spending 
exponential time to solve the realizability problem. Hence, each approach suffers 
an exponential blow-up a-priori, which motivates the following question: is there 
an efficient parameterized algorithm for the consistency problem? That is, we 
are interested in an algorithm that is exponential-time in the worst case (as the 
problem is NP-hard in general), but efficient when certain natural parameters 
of the input are small, and thus only becomes slow in extreme cases. 

Another disadvantage of these works is that each of the exploration algo- 
rithms can end up to the same class of the partitioning many times, further 
hindering performance. To see an example, consider the program in Fig. 1 again. 
The work of [23] assigns values to reads one by one, and in this example, it needs 
to consider as separate cases both permutations of the two reads as the orders 
for assigning the values. This is to ensure completeness in cases where there are 
write events causally dependent on some read events (e.g., a write event appear- 
ing only if its thread-predecessor reads a certain value). However, no causally 
dependent write events are present in this program, and our work uses a prin- 
cipled approach to detect this and avoid the redundant exploration. While an 
example to demonstrate [24] revisiting partitioning classes is a bit more involved 
one, this property follows from the lack of information sharing between spawned 
subroutines, enabling the approach to be massively parallelized, which has been 
discussed already in prior works [21, 23, 26]. 
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Our Contributions 


In this work we tackle the two challenges illustrated in the motivating example 
in a principled, algorithmic way. In particular, our contributions are as follows. 


(1) 


2 


We study the problem of verifying the sequentially consistent executions. 
The problem is known to be NP-hard [25] in general, already for 3 threads. 
We show that the problem can be solved in O(k?+1 - n**++) time for an 
input of n events, k threads and d variables. Thus, although the problem 
NP-hard in general, it can be solved in polynomial time when the number of 
threads and number of variables is bounded. Moreover, our bound reduces 
to O(n*+") in the class of programs where every variable is written by only 
one thread (while read by many threads). Hence, in this case the bound is 
polynomial for a fixed number of threads and without any dependence on 
the number of variables. 

We define a new equivalence between concurrent traces, called the reads- 
value-from (RVF) equivalence. Intuitively, two traces are RVF-equivalent if 
they agree on the value obtained in each read event, and read events induce 
consistent causal orderings between them. We show that RVF induces a 
coarser partitioning than the partitionings explored by recent well-studied 
SMC algorithms [15,21,23], and thus reduces the search space of the model 
checker. 

We develop a novel SMC algorithm called RVF-SMC, and show that it 
is sound and complete for local safety properties such as assertion viola- 
tions. Moreover, RVF-SMC has complexity k - n°) . 8, where 8 is the 
size of the underlying RVF partitioning. Under the hood, RVF-SMC uses 
our consistency-checking algorithm of Item 1 to visit each RVF class during 
the exploration. Moreover, RVF-SMC uses a novel heuristic to significantly 
reduce the number of revisits in any given RVF class, compared to the 
value-based explorations of [23, 24]. 

We implement RVF-SMC in the stateless model checker Nidhugg [27]. Our 
experimental evaluation reveals that RVF is quite often a very effective 
equivalence, as the underlying partitioning is exponentially coarser than 
other approaches. Moreover, RVF-SMC generates representatives very eff- 
ciently, as the reduction in the partitioning is often met with significant 
speed-ups in the model checking task. 


Preliminaries 


General Notation. Given a natural number i > 1, we let [i] be the set 
{1,2,...,i}. Given a map f: X — Y, we let dom(f) = X denote the domain of 
f. We represent maps f as sets of tuples { (x, f(x))},. Given two maps fi, f2 over 
the same domain X, we write fı = fo if for every x € X we have f(x) = fo(z). 
Given a set X’ C X, we denote by f|X’ the restriction of f to X’. A binary 
relation ~ on a set X is an equivalence iff ~ is reflexive, symmetric and transitive. 
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2.1 Concurrent Model 


Here we describe the computational model of concurrent programs with shared 
memory under the Sequential Consistency (SC) memory model. We follow a 
standard exposition of stateless model checking, similarly to [14,15, 21-23, 28], 
Concurrent Program. We consider a concurrent program H = {thr;}*_, of 
k deterministic threads. The threads communicate over a shared memory G of 
global variables with a finite value domain D. Threads execute events of the 
following types. 


(1) A write event w writes a value v € D to a global variable x € G. 
(2) A read event r reads the value v € D of a global variable x € G. 


Additionally, threads can execute local events which do not access global vari- 
ables and thus are not modeled explicitly. 

Given an event e, we denote by thr(e) its thread and by var(e) its global 
variable. We denote by € the set of all events, and by R (W) the set of read 
(write) events. Given two events e1,e2 € E, we say that they conflict, denoted 
€1 ™ e2, if they access the same global variable and at least one of them is a 
write event. 


Concurrent Program Semantics. The semantics of H are defined by means 
of a transition system over a state space of global states. A global state consists 
of (i) a memory function that maps every global variable to a value, and (ii) 
a local state for each thread, which contains the values of the local variables 
and the program counter of the thread. We consider the standard setting of 
Sequential Consistency (SC), and refer to [14] for formal details. As usual, H is 
execution-bounded, which means that the state space is finite and acyclic. 


Event Sets. Given a set of events X C E, we write R(X) = XA R for the set 
of read events of X, and W(X) = X NW for the set of write events of X. Given 
a set of events X C E and a thread thr, we denote by Xthr and Xyth, the events 
of thr, and the events of all other threads in X, respectively. 


Sequences and Traces. Given a sequence of events T = €1,...,e;, we denote 
by E(r) the set of events that appear in r. We further denote R(T) = R(E(r)) 
and W(r) = W(E(7)). 

Given a sequence 7 and two events e1, e2 E€ E(T), we write e1 <+ e2 when e1 
appears before eg in 7, and e; <+ e2 to denote that e1 <+ e2 or e1 = eg. Given 
a sequence 7 and a set of events A, we denote by r|A the projection of T on A, 
which is the unique subsequence of 7 that contains all events of ANE(r), and only 
those events. Given a sequence 7 and a thread thr, let thr be the subsequence 
of T with events of thr, i.e., T|E(7)thr- Given two sequences 7, and 72, we denote 
by 71 © Tə the sequence that results in appending 72 after 71. 

A (concrete, concurrent) trace is a sequence of events o that corresponds to 
a concrete valid execution of H. We let enabled(c) be the set of enabled events 
after ø is executed, and call o maximal if enabled(c) = Ø. As H is bounded, 
all executions of H are finite and the length of the longest execution in H is a 
parameter of the input. 
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Reads-From and Value Functions. Given a sequence of events 7, we define 
the reads-from function of T, denoted RF}: R(T) — Wi(r), as follows. Given a 
read event r € R(T), we have that RF,(r) is the latest write (of any thread) 
conflicting with r and occurring before r in 7, i.e., (i) RF-(r) ™ r, (ii) RF-(r) <- 
r, and (iii) for each @ € W(r) such that @ m rand © <, r, we have w <, RF-(r). 
We say that r reads-from RF,(r) in 7. For simplicity, we assume that H has an 
initial salient write event on each variable. 

Further, given a trace ø, we define the value function of o, denoted 
valo: E(a) — D, such that val,(e) is the value of the global variable var(e) 
after the prefix of o up to and including e has been executed. Intuitively, val,(e) 
captures the value that a read (resp. write) event e shall read (resp. write) in ø. 
The value function val, is well-defined as ø is a valid trace and the threads of 
H are deterministic. 


2.2 Partial Orders 


In this section we present relevant notation around partial orders, which are a 
central object in this work. 


Partial Orders. Given a set of events X C £, a (strict) partial order P over X is 
an irreflexive, antisymmetric and transitive relation over X (i.e., <p C X x X). 
Given two events €1,€2 E€ X, we write e} <p ez to denote that e1 <p e2 or 
e1 = e2. Two distinct events €1,€2 E€ X are unordered by P, denoted e; || p e2, if 
neither e; <p e2 nor e2 <p €1, and ordered (denoted e; |{p e2) otherwise. Given 
a set Y C X, we denote by P|Y the projection of P on the set Y, where for 
every pair of events e1,e2 € Y, we have that e1 <pyy e2 iff e1 <p eg. Given two 
partial orders P and Q over a common set X, we say that Q refines P, denoted 
by Q C P, if for every pair of events e1,€2 E€ X, if e} <p ez then e] <Q e2. A 
linearization of P is a total order that refines P. 


Lower Sets. Given a pair (X, P), where X is a set of events and P is a partial 
order over X, a lower set of (X,P) is a set Y C X such that for every event 
eı € Y and event e2 € X with eg <p e1, we have ez € Y. 


Visible Writes. Given a partial order P over a set X, and a read event r € 
R(X), the set of visible writes of r is defined as 


VisibleW p(r) ={ w € W(X): (i)r x w and (ii) r p w and (iii) for each 
w € W(X) with r x w, if w <p w then w’ <p r } 


i.e., the set of write events w conflicting with r that are not “hidden” to r by P. 


The Program Order PO. The program order PO of H is a partial order 
<poC E x € that defines a fixed order between some pairs of events of the same 
thread, reflecting the semantics of H. 

A set of events X C £ is proper if (i) it is a lower set of (€, PO), and (ii) for 
each thread thr, the events X¢hr are totally ordered in PO (i.e., for each distinct 
€1,€2 E€ Xthr we have e1 fpo e2). A sequence 7 is well-formed if (i) its set of events 
€(r) is proper, and (ii) 7 respects the program order (formally, r E PO|E(r)). 
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Every trace o of H is well-formed, as it corresponds to a concrete valid execution 
of H. Each event of H is then uniquely identified by its PO predecessors, and by 
the values its PO predecessor reads have read. 


thry thro thrs 


Fig. 2. A trace ø, the displayed events E(c) are vertically ordered as they appear in ø. 
The solid black edges represent the program order PO. The dashed red edges represent 
the reads-from function RF,. The transitive closure of all the edges then gives us the 
causally-happens-before partial order >s. 


Causally-Happens-Before Partial Orders. A trace ø induces a causally- 
happens-before partial order >, C E(o) x E(a), which is the weakest partial 
order such that (i) it refines the program order (i.e., œs E PO|E(c)), and (ii) for 
every read event r € R(o), its reads-from RF,(r) is ordered before it (i.e., 
RF,(r) >o r). Intuitively, +, contains the causal orderings in g, i.e., it captures 
the flow of write events into read events in ø together with the program order. 
Figure 2 presents an example of a trace and its causal orderings. 


3 Reads-Value-From Equivalence 


In this section we present our new equivalence on traces, called the reads-value- 
from equivalence (RVF equivalence, or ~prvrf, for short). Then we illustrate that 
~rve has some desirable properties for stateless model checking. 


Reads-Value-From Equivalence. Given two traces 0, and a2, we say that 
they are reads-value-from-equivalent, written 01 ~RvF 02, if the following hold. 


(1) E(o1) = E(o2), i.e., they consist of the same set of events. 
(2) vals, = val,,, i.e., each event reads resp. writes the same value in both. 
(3) =o, [R = |o, |R, i.e., their causal orderings agree on the read events. 


Figure3 presents an intuitive example of RVF-(in)equivalent traces. 
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Fig. 3. Three traces 01, o2, 03, events of each trace are vertically ordered as they 
appear in the trace. Traces cı and o2 are RVF-equivalent (o1 ~rvF o2), as they have 
the same events, same value function, and the two read events are causally unordered 
in both. Trace o3 is not RVF-equivalent with either of o1 and oz. Compared to oi 
resp. 02, the value function of o3 differs (r(y) reads a different value), and the causal 
orderings of the reads differ (r(x)—>sr(y)). 


Soundness. The RVF equivalence induces a partitioning on the maximal traces 
of H. Any algorithm that explores each class of this partitioning provably dis- 
covers every reachable local state of every thread, and thus RVF is a sound 
equivalence for local safety properties, such as assertion violations, in the same 
spirit as in other recent works [21-24]. This follows from the fact that for any 
two traces cı and o2 with E(o,) = E(o2) and val,, = val,,, the local states of 
each thread are equal after executing cı and o2. 


--- reads-from[22,28] <~. 


K ` 
reads-value-from data-centric[21] <4 Mazurkiewicz[14,15,29] 
K 


~--- value-centric[23] 4--7 7 


Fig. 4. SMC trace equivalences. An edge from X to Y signifies that Y is always at least 
as coarse, and sometimes coarser, than X. 


Coarseness. Here we describe the coarseness properties of the RVF equiva- 
lence, as compared to other equivalences used by state-of-the-art approaches in 
stateless model checking. Figure 4 summarizes the comparison. 

The SMC algorithms of [22] and [28] operate on a reads-from equivalence, 
which deems two traces cı and gg equivalent if 


(1) they consist of the same events (E(01) = E(a2)), and 
(2) their reads-from functions coincide (RF,, = RFs, ). 


The above two conditions imply that the induced causally-happens-before partial 
orders are equal, i.e., >o; = >o, and thus trivially also >o |R = >o |R. 
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Further, by a simple inductive argument the value functions of the two traces 
are also equal, i.e., val, = val,,. Hence any two reads-from-equivalent traces are 
also RVF-equivalent, which makes the RVF equivalence always at least as coarse 
as the reads-from equivalence. 

The work of [23] utilizes a value-centric equivalence, which deems two traces 
equivalent if they satisfy all the conditions of our RVF equivalence, and also 
some further conditions (note that these conditions are necessary for correctness 
of the SMC algorithm in [23]). Thus the RVF equivalence is trivially always at 
least as coarse. The value-centric equivalence preselects a single thread thr, and 
then requires two extra conditions for the traces to be equivalent, namely: 


(1) For each read of thr, either the read reads-from a write of thr in both traces, 
or it does not read-from a write of thr in either of the two traces. 

(2) For each conflicting pair of events not belonging to thr, the ordering of the 
pair is equal in the two traces. 


Both the reads-from equivalence and the value-centric equivalence are in turn 
as coarse as the data-centric equivalence of [21]. Given two traces, the data- 
centric equivalence has the equivalence conditions of the reads-from equivalence, 
and additionally, it preselects a single thread thr (just like the value-centric equiv- 
alence) and requires the second extra condition of the value-centric equivalence, 
i.e., equality of orderings for each conflicting pair of events outside of thr. 

Finally, the data-centric equivalence is as coarse as the classical Mazurkiewicz 
equivalence [13], the baseline equivalence for stateless model checking [14, 15, 29]. 
Mazurkiewicz equivalence deems two traces equivalent if they consist of the same 
set of events and they agree on their ordering of conflicting events. 

While RVF is always at least as coarse, it can be (even exponentially) 
coarser, than each of the other above-mentioned equivalences. We illustrate 
this in Appendix B of [30]. We summarize these observations in the following 
proposition. 


Proposition 1. RVF is at least as coarse as each of the Mazurkiewicz equiva- 
lence [15], the data-centric equivalence [21], the reads-from equivalence [22], and 
the value-centric equivalence [23]. Moreover, RVF can be exponentially coarser 
than each of these equivalences. 


In this work we develop our SMC algorithm RVF-SMC around the RVF 
equivalence, with the guarantee that the algorithm explores at most one maxi- 
mal trace per class of the RVF partitioning, and thus can perform significantly 
fewer steps than algorithms based on the above equivalences. To utilize RVF, the 
algorithm in each step solves an instance of the verification of sequential con- 
sistency problem, which we tackle in the next section. Afterwards, we present 
RVF-SMC. 
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4 Verifying Sequential Consistency 


In this section we present our contributions towards the problem of verifying 
sequential consistency (VSC). We present an algorithm VerifySC for VSC, and 
we show how it can be efficiently used in stateless model checking. 


The VSC Problem. Consider an input pair (X,GoodW) where 


(1) X C £ is a proper set of events, and 
(2) GoodW: R(X) — 2”) is a good-writes function such that w € GoodW(r) 
only if r x w. 


A witness of (X,GoodW) is a linearization 7 of X (i.e., E(T) = X) respecting the 
program order (i.e., 7 E PO|X), such that each read r € R(T) reads-from one of 
its good-writes in 7, formally RF,(1r) € GoodW(r) (we then say that 7 satisfies 
the good-writes function GoodW). The task is to decide whether (X, GoodW) 
has a witness, and to construct one in case it exists. 


VSC in Stateless Model Checking. The VSC problem naturally ties in with 
our SMC approach enumerating the equivalence classes of the RVF trace parti- 
tioning. In our approach, we shall generate instances (X,GoodW) such that (i) 
each witness ø of (X,GoodW) is a valid program trace, and (ii) all witnesses 
01,02 of (X, GoodW) are pairwise RVF-equivalent (01 ~Rvr 02). 

Hardness of VSC. Given an input (X,GoodW) to the VSC problem, let n = 
|X|, let k be the number of threads appearing in X, and let d be the number of 
variables accessed in X. The classic work of [25] establishes two important lower 
bounds on the complexity of VSC: 


(1) VSC is NP-hard even when restricted only to inputs with k = 3. 
(2) VSC is NP-hard even when restricted only to inputs with d = 2. 


The first bound eliminates the possibility of any algorithm with time complexity 
O(nf), where f is an arbitrary computable function. Similarly, the second 
bound eliminates algorithms with complexity O(n‘) for any computable f. 
In this work we show that the problem is parameterizable in k + d, and thus 
admits efficient (polynomial-time) solutions when both variables are bounded. 


4.1 Algorithm for VSC 


In this section we present our algorithm VerifySC for the problem VSC. First we 
define some relevant notation. In our definitions we consider a fixed input pair 
(X, GoodW) to the VSC problem, and a fixed sequence 7 with E(r) C X. 

Active Writes. A write w € W(r) is active in 7 if it is the last write of its 
variable in 7. Formally, for each w’ € W(r) with var(w’) = var(w) we have 
w’ <, w. We can then say that w is the active write of the variable var(w) in T. 
Held Variables. A variable x € G is held in 7 if there exists a read r € 
R(X) \E(T) with var(r) = x such that for each its good-write w € GoodW(r) we 
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have w € T. In such a case we say that r holds x in T. Note that several distinct 
reads may hold a single variable in 7. 


Executable Events. An event e € E(X) \ E(r) is executable in T if E(r) U {e} 
is a lower set of (X,PO) and the following hold. 


(1) Ife is a read, it has an active good-write w € GoodW(e) in 7. 
(2) If e is a write, its variable var(e) is not held in 7. 


Memory Maps. A memory map of 7 is a function from global variables to 
thread indices MMap,: G — [k] where for each variable x € G, the map 
MMap,() captures the thread of the active write of x in 7. 


Witness States. The sequence 7 is a witness prefix if the following hold. 


(1) 7 is a witness of (E(r), GoodW|R(r)). 
(2) For each r € X \ R(T) that holds its variable var(r) in 7, one of its good- 
writes w E€ GoodW(r) is active in 7. 


Intuitively, 7 is a witness prefix if it satisfies all VSC requirements modulo its 
events, and if each read not in 7 has at least one good-write still available to read- 
from in potential extensions of 7. For a witness prefix 7 we call its corresponding 
event set and memory map a witness state. 

Figure5 provides an example illustrating the above concepts, where for 
brevity of presentation, the variables are subscripted and the values are not 
displayed. 


thry thre thrs 


Fig. 5. Event set X, and the good-writes function GoodW denoted by the green dotted 
edges. The solid nodes are ordered vertically as they appear in 7. The grey dashed 
nodes are in X \ €(r). Events rz and w’, are executable in 7. Event 7, is not, its good- 
write is not active in 7. Event Wy is also not executable, as its variable y is held by 
ry. The memory map of r is MMap, (x) = 1 and MMap,(y) = 3. 7 is a witness prefix, 
and €(r) with MMap, together form its witness state. 
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Algorithm 1: VerifySC 


Input: Proper event set X and good-writes function GoodW: R(X) > 2”) 
Output: A witness 7 of (X,GoodW) if (X, GoodW) has a witness, else r = L 
S — {e}; Done — {e} 
while S #9 do 
Extract a sequence T from S 
if E(r) = X then return 7 i // All events executed, witness found 
foreach event e executable in T do 
Let Te “TOE // Execute € 
if Ar’ € Done s.t. E(Te) = E(r’) and MMap,, = MMap,, then 


| Insert Te in S and in Done // New witness state reached 


OANA BR WN 


return | // No witness exists 


Algorithm. We are now ready to describe our algorithm VerifySC, in Algorithm 
1 we present the pseudocode. We attempt to construct a witness of (X,GoodW) 
by enumerating the witness states reachable by the following process. We start 
(Line 1) with an empty sequence € as the first witness prefix (and state). We 
maintain a worklist S of so-far unprocessed witness prefixes, and a set Done 
of reached witness states. Then we iteratively obtain new witness prefixes (and 
states) by considering an already obtained prefix (Line 3) and extending it with 
each possible executable event (Line 6). Crucially, when we arrive at a sequence 
Te, we include it only if no sequence 7’ with equal corresponding witness state 
has been reached yet (Line 7). We stop when we successfully create a witness 
(Line 4) or when we process all reachable witness states (Line 9). 


Correctness and Complexity. We now highlight the correctness and complex- 
ity properties of VerifySC, while we refer to Appendix C of [30] for the proofs. 
The soundness follows straightforwardly by the fact that each sequence in S is a 
witness prefix. This follows from a simple inductive argument that extending a 
witness prefix with an executable event yields another witness prefix. The com- 
pleteness follows from the fact that given two witness prefixes 7, and 72 with 
equal induced witness state, these prefixes are “equi-extendable” to a witness. 
Indeed, if a suffix 7* exists such that 71 o 7* is a witness of (X, GoodW), then 
T2 0 T* is also a witness of (X,GoodW). The time complexity of VerifySC is 
bounded by O(n*+1 . k4+"), for n events, k threads and d variables. The bound 
follows from the fact that there are at most n” - k? pairwise distinct witness 
states. We thus have the following theorem. 


Theorem 1. VSC forn events, k threads and d variables is solvable in O(n**1. 


kt!) time. Moreover, if each variable is written by only one thread, VSC is 
solvable in O(n**+) time. 


Implications. We now highlight some important implications of Theorem 1. 
Although VSC is NP-hard [25], the theorem shows that the problem is param- 
eterizable in k + d, and thus in polynomial time when both parameters are 
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bounded. Moreover, even when only k is bounded, the problem is fixed-parameter 
tractable in d, meaning that d only exponentiates a constant as opposed to n 
(e.g., we have a polynomial bound even when d = logn). Finally, the algorithm 
is polynomial for a fixed number of threads regardless of d, when every mem- 
ory location is written by only one thread (e.g., in producer-consumer settings, 
or in the concurrent-read-exclusive-write (CREW) concurrency model). These 
important facts brought forward by Theorem 1 indicate that VSC is likely to be 
efficiently solvable in many practical settings, which in turn makes RVF a good 
equivalence for SMC. 


4.2 Practical Heuristics for VerifySC in SMC 


We now turn our attention to some practical heuristics that are expected to 
further improve the performance of VerifySC in the context of SMC. 


1. Limiting the Search Space. We employ two straightforward improvements 
to VerifySC that significantly reduce the search space in practice. Consider the 
for-loop in Line 5 of Algorithm 1 enumerating the possible extensions of T. This 
enumeration can be sidestepped by the following two greedy approaches. 


(1) If there is a read r executable in 7, then extend 7 with r and do not 
enumerate other options. 

(2) Let @ be an active write in T such that Ū is not a good-write of any r € 
R(X)\E(7). Let w € W(X) \E(T) bea write of the same variable (var(w) = 
var(w)), note that w is executable in 7. If w is also not a good-write of any 
r E R(X) \E(T), then extend r with w and do not enumerate other options. 


The enumeration of Line 5 then proceeds only if neither of the above two tech- 
niques can be applied for 7. This extension of VerifySC preserves complete- 
ness (not only when used during SMC, but in general), and it can be signifi- 
cantly faster in practice. For clarity of presentation we do not fully formalize 
this extended version, as its worst-case complexity remains the same. 


2. Closure. We introduce closure, a low-cost filter for early detection of VSC 
instances (X,GoodW) with no witness. The notion of closure, its beneficial prop- 
erties and construction algorithms are well-studied for the reads-from consistency 
verification problems [21,22,31], i.e., problems where a desired reads-from func- 
tion is provided as input instead of a desired good-writes function GoodW. Fur- 
ther, the work of [23] studies closure with respect to a good-writes function, but 
only for partial orders of Mazurkiewicz width 2 (i.e., for partial orders with no 
triplet of pairwise conflicting and pairwise unordered events). Here we define 
closure for all good-writes instances (X,GoodW), with the underlying partial 
order (in our case, the program order PO) of arbitrary Mazurkiewicz width. 

Given a VSC instance (X,GoodW), its closure P(X) is the weakest partial 
order that refines the program order (P E PO|X) and further satisfies the fol- 
lowing conditions. Given a read r E€ R(X), let Cl(r) = GoodW(r)MVisibleW p(r). 
The following must hold. 
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) Clr) #0. 
2) If (Cl(r), P|Cl(r)) has a least element w, then w <p r. 

) If (Cl(r), P|Cl(r)) has a greatest element w, then for each © € W(X) \ 
GoodW(r) with r m w, if W <p r then WU <p w. 
(4) For each w € W(X) \ GoodW(r) with r ™ W, if each w € Cl(r) satisfies 
w <p W, then we have r <p W. 


If (X, GoodW) has no closure (i.e., there is no P with the above conditions), 
then (X,GoodW) provably has no witness. If (X,GoodW) has closure P, then 
each witness T of VSC(X, GoodW) provably refines P (i.e., r E P). 

Finally, we explain how closure can be used by VerifySC. Given an input 

(X,GoodW), the closure procedure is carried out before VerifySC is called. 
Once the closure P of (X,GoodW) is constructed, since each solution of 
VSC(X, GoodW) has to refine P, we restrict VerifySC to only consider sequences 
refining P. This is ensured by an extra condition in Line 5 of Algorithm 1, where 
we proceed with an event e only if it is minimal in P restricted to events not yet 
in the sequence. This preserves completeness, while further reducing the search 
space to consider for VerifySC. 
3. VerifySC Guided by Auxiliary Trace. In our SMC approach, each time 
we generate a VSC instance (X,GoodW), we further have available an auxiliary 
trace o. In o, either all-but-one, or all, good-writes conditions of GoodW are 
satisfied. If all good writes in GoodW are satisfied, we already have o as a witness 
of (X,GoodW) and hence we do not need to run VerifySC at all. On the other 
hand, if case all-but-one are satisfied, we use o to guide the search of VerifySC, 
as described below. 

We guide the search by deciding the order in which we process the sequences 
of the worklist S in Algorithm 1. We use the auxiliary trace o with E(a) = X. 
We use S as a last-in-first-out stack, that way we search for a witness in a depth- 
first fashion. Then, in Line 5 of Algorithm 1 we enumerate the extension events 
in the reverse order of how they appear in o. We enumerate in reverse order, as 
each resulting extension is pushed into our worklist S, which is a stack (last-in- 
first-out). As a result, in Line 3 of the subsequent iterations of the main while 
loop, we pop extensions from S in order induced by o. 


5 Stateless Model Checking 


We are now ready to present our SMC algorithm RVF-SMC that uses RVF to 
model check a concurrent program. RVF-SMC is a sound and complete algorithm 
for local safety properties, i.e., it is guaranteed to discover all local states that 
each thread visits. 

RVF-SMC is a recursive algorithm. Each recursive call of RVF-SMC is argu- 
mented by a tuple (X, GoodW, ø, C) where: 


(1) X is a proper set of events. 
(2) GoodW: R(X) — 2) is a desired good-writes function. 
(3) o is a valid trace that is a witness of (X,GoodW). 
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(4) C: R — Threads — N is a partial function called causal map that tracks 
implicitly, for each read r, the writes that have already been considered as 
reads-from sources of r. 


Further, we maintain a function ancestors: R(X) — {true, false}, where for each 
read r € R(X), ancestors(r) stores a boolean backtrack signal for r. We now 
provide details on the notions of causal maps and backtrack signals. 


Causal Maps. The causal map C serves to ensure that no more than one max- 
imal trace is explored per RVF partitioning class. Given a read r € enabled(c) 
enabled in a trace ø, we define forbidsS(r) as the set of writes in ø such that C 
forbids r to read-from them. Formally, forbids (r) = Ø if r g dom(C), otherwise 
forbidsS(r) = {w € W(o) | w is within first C(r)(thr(w)) events of othr}. We say 
that a trace o satisfies C if for each r € R(c) we have RF,(r) ¢ forbidsS (r). 


Backtrack Signals. Each call of RVF-SMC (with its GoodW) operates with a 
trace o satisfying GoodW that has only reads as enabled events. Consider one of 
those enabled reads r € enabled(c). Each maximal trace satisfying GoodW shall 
contain r, and further, one of the following two cases is true: 


(1) In all maximal traces o’ satisfying GoodW, we have that r reads-from some 
write of W(c) in ø’. 

(2) There exists a maximal trace o’ satisfying GoodW, such that r reads-from 
a write not in W(ọ) in o’. 


Whenever we can prove that the first above case is true for r, we can use this fact 
to prune away some recursive calls of RVF-SMC while maintaining completeness. 
Specifically, we leverage the following crucial lemma, and present the proof in 
Appendix D of [30]. 


Lemma 1. Consider a call RVF-SMC(X, GoodW, c,C) and a trace © extending 
a maximally such that no event of the extension is a read. Let r € enabled(c) 
such that r g dom(C). If there exists a trace o’ that (i) satisfies GoodW and C, 
and (ii) contains r with RF (r) g W(a), then there exists a trace & that (i) 
satisfies GoodW and C, (ii) contains r with RFs(r) € W(c), and (iii) contains 
a write w € W(c) with r x w and thr(r) A thr(w). 


We then compute a boolean backtrack signal for a given RVF-SMC call and 
read r € enabled(c) to capture satisfaction of the consequent of Lemma 1. If the 
computed backtrack signal is false, we can safely stop the RVF-SMC exploration 
of this specific call and backtrack to its recursion parent. 


Algorithm. We are now ready to describe our algorithm RVF-SMC in detail, 
Algorithm 2 captures the pseudocode of RVF-SMC(X, GoodW,o,C). First, in 
Line 1 we extend o to o maximally such that no event of the extension is a 
read. Then in Lines 2-5 we update the backtrack signals for ancestors of our 
current recursion call. After this, in Lines 6-11 we construct a sequence of reads 
enabled in a. Finally, we proceed with the main while-loop in Line 13. In each 
while-loop iteration we process an enabled read r (Line 14), and we perform 
no more while-loop iterations in case we receive a false backtrack signal for r. 
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Algorithm 2: RVF-SMC(X, GoodW, g, C) 
Input: Proper set of events X, good-writes function GoodW, valid trace o that 


is a witness of (X,GoodW), causal map C. 
1 © —o 0G where G extends o maximally such that no event of G is a read 


2 foreach w € E(G) do // All extension events are writes 
3 foreach r € dom(ancestors) do // All ancestor mutations are reads 
4 if rx w and thr(r) x thr(w) then // Potential new source for r to read-from 
5 | ancestors(r) < true // Set backtrack signal to true 
6 mutate < € // Construct a sequence of enabled reads 
7 foreach r € enabled(a) do // Enabled events in & are reads 
8 if re dom(C) then // Causal map C is defined for 7 
9 | | mutate — mutate or // Insert r to the end of mutate 
10 else // Causal map C is undefined for 1 
11 | mutate <— r o mutate // Insert r to the beginning of mutate 


12 backtrack — true 
13 while backtrack = true and mutate Æ € do 


14 |r — pop front of mutate // Process next read of mutate 
15 | if r g dom(C) then 

16 | backtrack < false 

17 Fr — VisibleW poje (z) (r) \ forbids (r) // Visible writes not forbidden by C 
18 D, — {vals (w) : WE F,} // The set of values that r may read 
19 foreach v € D, do // Process each value 
20 X’ -XU E(G) U {r} // New event set 
21 GoodW’ — GoodW U {(r,{ w € F, | vals(w) = v })} // New good-writes 
22 a’ — VerifySC(X"', GoodW’) // VerifySC guided by Gor 
23 if o’ x L then // (X',GoodW’) has a witness 
24 CcC 

25 ancestors(r) — backtrack // Record ancestor 
26 RVE-SMC(X", GoodW’, o',C’) 

27 backtrack — ancestors(r) // Retrieve backtrack signal 
28 delete r from ancestors // Unrecord ancestor 
29 foreach thr € Threads do // Update causal map C(r) for each thread 
30 C(r)(thr) i= |E(F)thr| // Number of events of thr in F 


When processing r, first we collect its viable reads-from sources in Line 17, then 
we group the sources by value they write in Line 18, and then in iterations of 
the for-loop in Line 19 we consider each value-group. In Line 20 we form the 
event set, and in Line 21 we form the good-write function that designates the 
value-group as the good-writes of r. In Line 22 we use VerifySC to generate a 
witness, and in case it exists, we recursively call RVF-SMC in Line 26 with the 
newly obtained events, good-write constraint for r, and witness. 

To preserve completeness of RVF-SMC, the backtrack-signals technique can 
be utilized only for reads r with undefined causal map r ¢ dom(C) (cf. Lemma 1). 
The order of the enabled reads imposed by Lines 6-11 ensures that subsequently, 
in iterations of the loop in Line 13 we first consider all the reads where we can 
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utilize the backtrack signals. This is an insightful heuristic that often helps in 
practice, though it does not improve the worst-case complexity. 
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Fig. 6. RVF-SMC (Algorithm 2). Circles represent nodes of the recursion tree. Below 
each circle is its corresponding event set €(a7) and the enabled reads (dashed grey). 
Writes with green background are good-writes (GoodW) of its corresponding-variable 
read. Writes with red background are forbidden by C for its corresponding-variable 
read. Dashed arrows represent recursive calls. (Color figure online) 


Example. Figure6 displays a simple concurrent program on the left, and 
its corresponding RVF-SMC (Algorithm 2) run on the right. We start with 
RVF-SMC(0,0,¢,@) (A). By performing the extension (Line 1) we obtain the 
events and enabled reads as shown below (A). First we process read rı (Line 
14). The read can read-from wı and w3, both write the same value so they are 
grouped together as good-writes of rı. A witness is found and a recursive call to 
(B) is performed. In (B), the only enabled event is r2. It can read-from w2 and 
w4, both write the same value so they are grouped for rə. A witness is found, a 
recursive call to (C) is performed, and (C) concludes with a maximal trace. Cru- 
cially, in (C) the event ws is discovered, and since it is a potential new reads-from 
source for r1, a backtrack signal is sent to (A). Hence after RVF-SMC backtracks 
to (A), in (A) it needs to perform another iteration of Line 13 while-loop. In 
(A), first the causal map C is updated to forbid w; and ws for rı. Then, read r2 
is processed from (A), creating (D). In (D), rı is the only enabled event, and ws 
is its only C-allowed write. This results in (E) which reports a maximal trace. 
The algorithm backtracks and concludes, reporting two maximal traces in total. 
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Theorem 2. Consider a concurrent program H of k threads and d variables, 
with n the length of the longest trace in H. RVF-SMC is a sound and complete 
algorithm for local safety properties in H. The time complexity of RVF-SMC is 
kt - nO) . B, where B is the size of the RVF trace partitioning of H. 


Novelties of the Exploration. Here we highlight some key aspects of 
RVF-SMC. First, we note that RVF-SMC constructs the traces incrementally 
with each recursion step, as opposed to other approaches such as [15,22] that 
always work with maximal traces. The reason of incremental traces is techni- 
cal and has to do with the value-based treatment of the RVF partitioning. We 
note that the other two value-based approaches [23,24] also operate with incre- 
mental traces. However, RVF-SMC brings certain novelties compared to these 
two methods. First, the exploration algorithm of [24] can visit the same class 
of the partitioning (and even the same trace) an exponential number of times 
by different recursion branches, leading to significant performance degradation. 
The exploration algorithm of [23] alleviates this issue using the causal map data 
structure, similar to our algorithm. The causal map data structure can provably 
limit the number of revisits to polynomial (for a fixed number of threads), and 
although it offers an improvement over the exponential revisits, it can still affect 
performance. To further improve performance in this work, our algorithm com- 
bines causal maps with a new technique, which is the backtrack signals. Causal 
maps and backtrack signals together are very effective in avoiding having differ- 
ent branches of the recursion visit the same RVF class. 


Beyond RVF Partitioning. While RVF-SMC explores the RVF partitioning in 
the worst case, in practice it often operates on a partitioning coarser than the one 
induced by the RVF equivalence. Specifically, RVF-SMC may treat two traces a1 
and og with same events (E(0,) = E(o2)) and value function (val,, = val,,) as 
equivalent even when they differ in some causal orderings (>o; |R 4 >o |R). To 
see an example of this, consider the program and the RVF-SMC run in Fig. 6. 
The recursion node (C) spans all traces where (i) rı reads-from either wı or 
w3, and (ii) r2 reads-from either w2 or w4. Consider two such traces cı and 
02, with RF,,(r2) = wz and RF,,(r2) = wa. We have ry++5,r2 and ry'4,,7r2, 
and yet cı and o2 are (soundly) considered equivalent by RVF-SMC. Hence the 
RVF partitioning is used to upper-bound the time complexity of RVF-SMC. We 
remark that the algorithm is always sound, i.e., it is guaranteed to discover all 
thread states even when it does not explore the RVF partitioning in full. 


6 Experiments 


In this section we describe the experimental evaluation of our SMC approach 
RVF-SMC. We have implemented RVF-SMC as an extension in Nidhugg [27], a 
state-of-the-art stateless model checker for multithreaded C/C++ programs that 
operates on LLVM Intermediate Representation. First we assess the advantages 
of utilizing the RVF equivalence in SMC as compared to other trace equivalences. 
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Then we perform ablation studies to demonstrate the impact of the backtrack 
signals technique (cf. Sect.5) and the VerifySC heuristics (cf. Sect. 4.2). 

In our experiments we compare RVF-SMC with several state-of-the-art SMC 
tools utilizing different trace equivalences. First we consider VC-DPOR [23], the 
SMC approach operating on the value-centric equivalence. Then we consider 
Nidhugg/rfsc [22], the SMC algorithm utilizing the reads-from equivalence. Fur- 
ther we consider DC-DPOR [21] that operates on the data-centric equivalence, 
and finally we compare with Nidhugg/source [15] utilizing the Mazurkiewicz 
equivalence.! The works of [22] and [32] in turn compare the Nidhugg/rfsc algo- 
rithm with additional SMC tools, namely GenMC [28] (with reads-from equiva- 
lence), RCMC [29] (with Mazurkiewicz equivalence), and CDSChecker [33] (with 
Mazurkiewicz equivalence), and thus we omit those tools from our evaluation. 

There are two main objectives to our evaluation. First, from Sect. 3 we know 

that the RVF equivalence can be up to exponentially coarser than the other 
equivalences, and we want to discover how often this happens in practice. Second, 
in cases where RVF does provide reduction in the trace-partitioning size, we aim 
to see whether this reduction is accompanied by the reduction in the runtime of 
RVF-SMC operating on RVF equivalence. 
Setup. We consider 119 benchmarks in total in our evaluation. Each benchmark 
comes with a scaling parameter, called the unroll bound. The parameter controls 
the bound on the number of iterations in all loops of the benchmark. For each 
benchmark and unroll bound, we capture the number of explored maximal traces, 
and the total running time, subject to a timeout of one hour. In Appendix E 
of [30] we provide further details on our setup. 
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Fig. 7. Runtime and traces comparison of RVF-SMC with VC-DPOR. 


1 The MCR algorithm [24] is beyond the experimental scope of this work, as that tool 
handles Java programs and uses heavyweight SMT solvers that require fine-tuning. 
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Fig. 10. Runtime and traces comparison of RVF-SMC with Nidhugg/source. 
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Results. We provide a number of scatter plots summarizing the comparison of 
RVF-SMC with other state-of-the-art tools. In Fig. 7, Fig.8, Fig.9 and Fig. 10 
we provide comparison both in runtimes and explored traces, for VC-DPOR, 
Nidhugg/rfsc, DC-DPOR, and Nidhugg/source, respectively. In each scatter plot, 
both its axes are log-scaled, the opaque red line represents equality, and the two 
semi-transparent lines represent an order-of-magnitude difference. The points 
are colored green when RVF-SMC achieves trace reduction in the underlying 
benchmark, and blue otherwise. 


Discussion: Significant Trace Reduction. In Table 1 we provide the results 
for several benchmarks where RVF achieves significant reduction in the trace- 
partitioning size. This is typically accompanied by significant runtime reduction, 
allowing is to scale the benchmarks to unroll bounds that other tools cannot han- 
dle. Examples of this are 27_Boop4 and scull_loop, two toy Linux kernel drivers. 

In several benchmarks the number of explored traces remains the same for 
RVF-SMC even when scaling up the unroll bound, see 45_monabsex1, reorder_5 
and singleton in Table 1. The singleton example is further interesting, in that while 
VC-DPOR and DC-DPOR also explore few traces, they still suffer in runtime 
due to additional redundant exploration, as described in Sects. 1 and 5. 


Table 1. Benchmarks with trace reduction achieved by RVF-SMC. The unroll bound 
is shown in the column U. Symbol “—” indicates one-hour timeout. Bold-font entries 
indicate the smallest numbers for respective benchmark and unroll. 


Benchmark U | RVF-SMC | VC-DPOR | Nidh/rfsc | DC-DPOR | Nidh/source 
27_Boop4 threads: 4 | Traces | 10 | 1337215 | 1574287 11610040 | — = 
12 | 2893039 |- = = = 
Times | 10 837 s 1946 s 2616s = = 
12 | 2017 s = = = = 
45_monabsex1 Traces} 7 | 1 423360 262144 7073803 25401600 
threads: U 
8 |1 = 4782969 | — — 
Times 7 0.09 s 784 s 33s 3239s 2819 s 
8 |0.09 s = 677 s = = 
reorder_5 threads: Traces| 9 |4 1644716 1540 1792290 = 
U+1 
30 4 = 54901 = = 
Times | 9 0.10s 1711s 0.44 s 974s = 
30 | 0.09 s = 49s = = 
scull_loop threads: 3 | Traces| 2 | 3908 15394 749811 884443 3157281 
3 | 115032 = 2 = = 
Times | 2 | 6.55 s 83 s 403 s 1659s 1116 s 
3 | 266s = = = = 
singleton threads: Traces | 20 | 2 2 20 20 = 
U+1 
30 2 = 30 a = 
Times | 20 0.07 s 179 s 0.08 s 171s a 
30 | 0.08 s = 0.10 s = = 
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Table 2. Benchmarks with little-to-no trace reduction by RVF-SMC. Symbol f indi- 
cates that a particular benchmark operation is not handled by the tool. 


Benchmark U | RVF-SMC | VC-DPOR | Nidh/rfsc | DC-DPOR | Nidh/source 
13_unverif threads: U | Traces |5 | 14400 14400 14400 14400 14400 
6 | 518400 = 518400 | — 518400 
Times | 5 | 7.45 s 63 s 3.33 s 68 s 2.72 s 
6 | 376s = 134 s = 84s 
approxds_append Traces | 6 | 50897 1256381 198936 1114746 9847080 
threads: U 
7 | 923526 = 4645207 | — = 
Times |6 | 60s 995 s 67s 944 s 2733 s 
7 | 2078 s = 2003 s = = 
chase-lev-dq threads: | Traces | 4 | 87807 7 175331 | 175331 
3 
5 | 227654 t 448905 | + 448905 
Times | 4 |289 s 7 71s t 71s 
5 |995 s + 210 s + 200 s 
linuxrwlocks threads: | Traces | 1 | 56 t 59 7 59 
U+1 
2 | 62018 7 70026 t 70026 
Times |1 |0.12 s 7 0.09 s t 0.13 s 
2 |42s t 15s hi 9.50 s 
pgsql threads: 2 Traces | 3 | 3906 3906 3906 3906 3906 
4 | 335923 335923 335923 | 335923 335923 
Times | 3 | 3.30 s 5.98 s 1.01 8 4.00 s 0.51 s 
4 |412 s 911s 107 s 616 s 51s 


Discussion: Little-to-no Trace Reduction. Table 2 presents several bench- 
marks where the RVF partitioning achieves little-to-no reduction. In these cases 
the well-engineered Nidhugg/rfsc and Nidhugg/source dominate the runtime. 


RVF-SMC Ablation Studies. Here we demonstrate the effect that fol- 
lows from our RVF-SMC algorithm utilizing the approach of backtrack signals 
(see Sect. 5) and the heuristics of VerifySC (see Sect. 4.2). These techniques have 
no effect on the number of the explored traces, thus we focus on the runtime. 
The left plot of Fig. 11 compares RVF-SMC as is with a RVF-SMC version that 
does not utilize the backtrack signals (achieved by simply keeping the backtrack 
flag in Algorithm 2 always true). The right plot of Fig. 11 compares RVF-SMC 
as is with a RVF-SMC version that employs VerifySC without the closure and 
auxiliary-trace heuristics. We can see that the techniques almost always result 
in improved runtime. The improvement is mostly within an order of magnitude, 
and in a few cases there is several-orders-of-magnitude improvement. 

Finally, in Fig. 12 we illustrate how much time during RVF-SMC is typically 
spent on VerifySC (i.e., on solving VSC instances generated during RVF-SMC). 
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Fig. 11. Ablation studies of RVF-SMC. The left plot compares RVF-SMC with and 
without backtrack signals. The right plots compares RVF-SMC with and without the 
closure and auxiliary-trace heuristics of Sect. 4.2. 
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Fig. 12. Histogram that illustrates the percentage of time spent solving VSC instances 
during RVF-SMC. 


7 Conclusions 


In this work we developed RVF-SMC, a new SMC algorithm for the verification 
of concurrent programs using a novel equivalence called reads-value-from (RVF). 
On our way to RVF-SMC, we have revisited the famous VSC problem [25]. 
Despite its NP-hardness, we have shown that the problem is parameterizable in 
k+d (for k threads and d variables), and becomes even fixed-parameter tractable 
in d when k is constant. Moreover we have developed practical heuristics that 
solve the problem efficiently in many practical settings. 

Our RVF-SMC algorithm couples our solution for VSC to a novel explo- 
ration of the underlying RVF partitioning, and is able to model check many 
concurrent programs where previous approaches time-out. Our experimental 
evaluation reveals that RVF is very often the most effective equivalence, as the 
underlying partitioning is exponentially coarser than other approaches. More- 
over, RVF-SMC generates representatives very efficiently, as the reduction in 
the partitioning is often met with significant speed-ups in the model checking 
task. Interesting future work includes further improvements over the VSC, as 
well as extensions of RVF-SMC to relaxed memory models. 
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Abstract. Go is an increasingly-popular systems programming lan- 
guage targeting, especially, concurrent and distributed systems. Go dif- 
ferentiates itself from other imperative languages by offering structural 
subtyping and lightweight concurrency through goroutines with message- 
passing communication. This combination of features poses interesting 
challenges for static verification, most prominently the combination of a 
mutable heap and advanced concurrency primitives. 

We present Gobra, a modular, deductive program verifier for Go 
that proves memory safety, crash safety, data-race freedom, and user- 
provided specifications. Gobra is based on separation logic and supports 
a large subset of Go. Its implementation translates an annotated Go 
program into the Viper intermediate verification language and uses an 
existing SMT-based verification backend to compute and discharge proof 
obligations. 


Keywords: Separation logic - Program logics - Channel-based 
concurrency - Interfaces - Deductive verification - Automated 
verification 


1 Introduction 


Go is an increasingly popular systems programming language targeting, espe- 
cially, concurrent and distributed systems such as web applications. It combines 
standard features of imperative languages, such as mutable heap data struc- 
tures, with less common concepts, such as structural subtyping and lightweight 
concurrency through goroutines with message-passing communication. 

This combination of features poses interesting challenges for static verifica- 
tion, most prominently the combination of a mutable heap and advanced concur- 
rency primitives. Prior research on Go verification handles some of these features, 
but not their combination. For instance, Lange et al. [14,15] verify safety and 
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liveness of Go’s message-passing, but do not consider functional properties about 
the heap state, whereas Perennial [4] supports heap data structures, but neither 
channels nor interfaces. 

We present Gobra, an automated, modular verifier for heap-manipulating, 
concurrent Go programs. Gobra supports a large subset of Go, including Go’s 
interfaces and primitive data structures, both of which have not been fully sup- 
ported in previous work. Gobra verifies memory safety, crash safety, data-race 
freedom, and user-provided specifications. It takes as input a Go program anno- 
tated with assertions such as pre and postconditions and loop invariants. Ver- 
ification proceeds by encoding the annotated programs into the intermediate 
verification language Viper [17] and then applying an existing SMT-based veri- 
fier. In case verification fails, Gobra reports at the level of the Go program which 
assertions it could not verify. 

Gobra’s assertion language builds on established concepts: Gobra uses sepa- 
ration logic style permissions [19] to reason locally about heap data structures. 
It supports recursive predicates and specification methods to abstract over (pos- 
sibly unbounded) data structures and their contents. In particular, Gobra has 
first-class predicates that enable a natural specification of concurrency primitives, 
for instance, to parameterize a lock by an invariant. 

Gobra is intended for the verification of substantial, real-world code, and is 
currently used to verify the Go implementation of the SCION internet architec- 
ture [23]. Our tool paper makes the following technical contributions: 


(1) We present the Gobra tool, an automated modular verifier for annotated Go 
programs. Our evaluation demonstrates that Gobra can verify non-trivial 
examples with good performance. Our artifact is available online [21]. 

(2) We define a specification language for functional properties of Go programs. 
Our specification language provides a consistent abstraction at the level of 
Go and does not leak details of the underlying encoding. 

(3) We present the first specification and verification technique for structural 
subtyping via Go interfaces. 

(4) Our Viper encoding supports, among other features, Go’s broad range of 
built-in data types, such as slices and channels. A lightweight annotation 
allows it to apply separation logic to reason soundly about addressable 
memory locations, but use a more efficient encoding for others. 


Outline. We demonstrate key features of Gobra on examples (Sect. 2), give an 
overview of the encoding into Viper (Sect. 3), and provide an experimental eval- 
uation of Gobra (Sect. 4). Lastly, Sect. 5 discusses related work and concludes. 


2 Gobra in a Nutshell 


This section illustrates Gobra’s specification language on simple examples and 
shows how we handle interfaces and concurrency. 
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2.1 Basics 


Gobra uses a variant of separation logic [19] in order to reason about muta- 
ble heap data structures and concurrency. Separation logics associate an access 
permission with each heap location. Access permissions are held by method 
executions and transferred between methods upon call and return. A method 
may access a location only if it holds the associated permission. Permission to a 
shared location v is denoted in Gobra by acc (&v), which is analogous to sepa- 
ration logic’s v> _. Gobra provides an expressive permission model supporting 
fractional permissions [3] to allow concurrent read accesses while still ensuring 
exclusive writes, (recursive) predicates to denote access to unbounded data struc- 
tures, and quantified permissions (also called iterated separating conjunction) to 
express permissions to random-access data structures such as arrays and slices. 


1 requires V k int : 0 < k < len(s) = > acc(&s[k]) 

2 ensures V k int :: 0 < k < len(s) => acc(&s[k]) 

3 ensures V k int :: 0 < k < len(s) => s[k] == old(s[k]) + n 
4 fune incr (s [Jint, n int) { 

5 invariant 0 < i < len(s) 

6 invariant V k int :: 0 < k < len(s) => acc(&s[k]) 

is invariant V k int :: i < k < len(s) => s[k] == old(s[k]) 
8 invariant V k int: 0 < k < i = > s[k] == old(s[k]) +n 
9 for i := 0; i < len(s); i += 1 { 

10 s[i] = s[i] + n 

11 } 

12 } 


Fig. 1. A simple Gobra example showing method and loop contracts. 


The example in Fig.1 illustrates the use of permissions. Method incr 
increases all elements of a given slice s by an amount n. (Slices are data types 
that can intuitively be seen as shared arrays of variable length.) The method 
requires permission to all slice elements (via its precondition) and returns them 
to the caller (via its first postcondition). 

Functional properties are expressed via standard assertions, which include 
side-effect free Go expressions (including calls to pure methods, as we explain 
below) as well as universal quantification and old-expressions to refer to the value 
an expression had in the pre-state of a method. In our example, the second 
postcondition uses these assertions to express the functional behavior of the 
method. The loop invariants are analogous to the method contracts and are 
needed for verification. 

In Go, any memory location can either be shared or exclusive. Shared loca- 
tions reside on the heap and can, thus, be accessed by multiple methods and 
threads; reasoning about shared locations requires permissions to ensure race 
freedom and to enable framing, i.e., preserving information across heap changes. 
On the other hand, exclusive locations are accessed exclusively by one method 
execution and may be allocated on the stack; they can be reasoned about as 
local variables. The Go compiler determines automatically whether a location is 
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shared or exclusive, for instance by determining whether its address is taken at 
some point of the execution. To make verification independent of a particular 
compiler analysis, Gobra requires shared locations to be decorated with an extra 
annotation @ at the declaration point, as illustrated by the following client of 


incr: 


1 a@ := [4Jint { 1, 2, 4, 8 } 
2 incr(a[2:], 2) 
3 assert == [4Jint { 1, 2, 6, 10 } 


The first line declares a Go array a of fixed length 4, with values 1, 2, 4, and 8. 
This array is sliced on line 2 using the syntax a[2:], thereby omitting the first 
two elements of a from the created slice. Since a is used in a context in which 
it is sliced, it is a shared location, which is made explicit via the @ annotation. 
Consequently, the array creation will produce permissions to the array elements, 
which are required by incr’s precondition. Omitting the @ annotation will cause 
a verification error. 


2.2 Interfaces 


Go supports polymorphism through interfaces, named sets of method signatures. 
Subtyping for interfaces is structural: a type implements an interface iff every 
method of the interface is implemented by the type. The subtype relationship is 
determined by the type checker, without any declarations from the programmer’. 

Calls on an interface value are dynamically dispatched. In settings with nomi- 
nal subtyping, dynamic dispatch is handled by proving behavioral subtyping [16]: 
each subtype declaration requires a proof that the specifications of subtype meth- 
ods refine the specifications of the corresponding supertype methods. Since struc- 
tural subtypes are not declared explicitly, we adapt this approach as follows. 

Whenever a Go program assigns a value to a variable of an interface type, 
Gobra requires an implementation proof, that is, a proof that each method of the 
subtype satisfies the specification of the corresponding method in the interface. 
Implementation proofs are inferred automatically by Gobra in simple cases; user- 
provided implementation proofs are required especially when they include ghost 
operations, for instance, to manipulate predicates. 

The example in Fig.2 illustrates this approach. Interface stream 
(lines 1-8) declares an interface with two methods, hasNext and next. The 
latter may return values of an arbitrary type, which is denoted by an empty 
interface. Since interfaces do not contain an implementation, their specification 
must be fully abstract. To this end, stream introduces an abstract predicate 
memory, whose definition is provided by the subtypes of the interface. The func- 
tional behavior of interface methods can be expressed in terms of pure (that is, 
side-effect free) abstract methods, here, hasNext, which will also be defined in 
subtypes. 

Next, lines 10-16 show an implementation of the interface in the form of a 
counter. The counter has a current f and maximum max value. As long as the 


1 For the sake of simplicity, we omit embeddings, Go’s construct for delegation; an 
extension is straightforward. 
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1 type stream interfacef{ 

2 pred memory () 

3 requires acc(memory(), _) // arbitrary fraction of memory () 
4 pure hasNext() bool 

5 requires memory() && hasNext () 

6 ensures memory () 

7 next() interface{} 

8 } 

9 

10 type counter struct{ f int; max int } 

11 requires acc(&x.f, _) && acc(&x.max, _) 

12 pure func (x *counter) hasNext() bool { return x.f < x.max } 
13 requires acc(&x.f) && acc(&x.max, 1/2) && x.hasNext () 

14 ensures acc(&x.f) && acc(&x.max, 1/2) && x.f == old(x.f)+1 
15 ensures typeOf(y) == int && y.(int) == old(x.f) 

16 func (x *counter) next() (y interface{}) { x.f+t+;return x.f-1 } 
17 

18 pred (x *counter) memory() { acc(&x.f) && acc(&x.max) } 

19 (*counter) implements stream { 

20 pure (x *counter) hasNextproor() bool { 

21 return unfolding acc(x.memory(), _) in x.hasNext () 

22 + 

23 (x *counter) nextpror() (res interface{}) { ... } 

24 } 


Fig. 2. An interface specification for a stream (lines 1-8) together with an implementa- 
tion (lines 10-16) and an implementation proof (lines 18-24). We write acc (p, _) to 
denote an arbitrary, positive amount of predicate p, and simply p for acc(p, 1/1). 
At line 14, the fractional permission to &x.max entails that x.max is not modified. 


maximum value is not reached, next will increase the current value. At line 16, 
an integer can be assigned to the empty interface since behavioral subtyping 
holds trivially. The specification at line 15 expresses that the returned interface 
value contains an integer with the old value of the f field. 

The counter implementation is completely independent of the stream inter- 
face. Their connection is established only in the implementation proof (lines 
18-24). This proof defines the memory predicate from the stream interface for 
receivers of type counter (line 18). Moreover, an implementation proof verifies 
that the specification of each method implementation refines the specification 
of the corresponding interface method. This proof checks that, assuming the 
precondition of an interface method, a call to the implementation method with 
identical arguments establishes the postcondition of the interface method. This 
format is enforced syntactically and permits ghost operations before and after 
the call to manipulate predicates. For instance, the proof on line 21 for hasNext 
temporarily unfolds the memory predicate to obtain permission to x, which is 
required by the implementation method, and conversely after the call. 

Implementation proofs can be written explicitly, imported from other pack- 
ages, and also inferred automatically when no explicit proof exists in the current 
scope. Currently, Gobra does not infer ghost operations such as the unfolding 
on line 21; our experiments suggest that already simple heuristics can deal with 
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many cases occurring in practice. For instance, many implementation proofs we 
have encountered follow the same pattern: First, the interface predicate instances 
of the precondition are unfolded. Second, the implementation method is called. 
Lastly, the interface predicate instances of the postcondition are folded. This 
pattern can be generated automatically to alleviate the annotation burden. 

Gobra’s implementation proofs enable one to reason about interfaces without 
enforcing subtype declarations in either the interface or the declaration, which 
would defeat the purpose of structural subtyping. This solution allows one to rea- 
son about dynamically-dispatched calls. For instance, the following code snippet 
verifies in Gobra: 


1 x := &counter{0, 50} 

2 var y stream = x 

3 fold y.memory () 

4 var z interface{} = y.next() 


In particular, Gobra is able to determine that next’s precondition 
hasNext () holds because y. hasNext () is equal to x. hasNext (), and the lat- 
ter follows from the definition of hasNext (line 12) and the initial value of x. f. 
This intuitive reasoning is enabled by an intricate underlying encoding, which 
is not exposed to users. Users do not have to know how interface predicates are 
encoded and can treat interface predicates the same as any other separation-logic 
predicate. 


2.3 Concurrency 


Go supports concurrency through goroutines, lightweight threads started by pre- 
fixing a method call with the go keyword. Go offers the usual synchronization 
primitives, but goroutines idiomatically synchronize via channels. Buffered chan- 
nels provide asynchronous communication, where sending a message blocks only 
when the buffer is full. Unbuffered channels offer rendez-vouz communication. 

Gobra enables verification of concurrent programs by associating Go’s syn- 
chronization primitives with predicates that do not only express properties of 
data but also express how permissions to shared memory get transferred between 
threads. For instance, lock invariants may include properties as well as permis- 
sions to the data protected by the lock, and channel invariants include properties 
and permissions of the data sent over a channel. These invariants are specified 
via ghost operations when the synchronization primitive is initialized. 

Figure3 illustrates Gobra’s concurrency support using an excerpt from a 
parallel search-and-replace algorithm (see the full paper [22] for the complete 
example). Method searchAndReplace spawns a series of worker threads and 
then sends each of them a chunk of the input slice to process. The worker threads 
are joined via a wait group wg. Method worker implements the worker threads. 

Gobra associates channels (like c in the example) with a predicate to specify 
properties and permissions of the sent data. The call c.Init(...) on line 10 
takes this predicate as an argument. As expressed on line 2, it includes permis- 
sions to the chunk a worker operates on. For synchronous channels, an additional 
predicate can specify permissions transferred in the opposite direction, from the 
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1 pred messagePerm(wg *sync.WaitGroup, chunk []int, x, y int) { 
2 (V i int: 0 < i < len(chunk) => acc(&chunk[i]) ) && 

3 } 

4 requires V i int :: 0 < i < len(s) => acc(&s[i]) 

5 func searchAndReplace(s []int, x, y int) { 

6 var wg@ sync.WaitGroup 

7 ghost wg.Init() 

8 c := make(chan []Jint ,4) 

9 // predicate-namef..., _, ...+ ts syntax for partial application 
10 ghost c.Init(messagePerm{&wg, _, x, y}) 

ji // Spawn workers 

12 invariant acc(c.RecvChannel(), _) 

13 invariant c.RecvGotPerm() == messagePerm{&wg, _, x, y} 

14 for i := 0; i < numOfWorkers; i++ { go worker(c, wg, x, y) } 
15 // Split slice into chunks, which are sent to workers 

16 invariant c.SendChannel () 

17 invariant c.SendGivenPerm() == messagePerm{&wg, _, x, y} 
18 invariant V i int :: offset < i < len(s) =  acc(&s[i]) 

19 invariant ... // constraints on offset and nextOffset 

20 for offset := 0; offset != len(s); offset = nextOffset { 
21 nextOffset = R 

22 wg.Add(1) 

23 fold messagePerm{&wg, _, x, y}(sloffset:nextOffset]) 

24 c <- sloffset:nextOffset] 

25 } 

26 wg.Wait() 

27 } 

28 requires acc(c.RecvChannel(), _) 

29 requires c.RecvGotPerm() == messagePerm{wg, _, x, yh; 

30 func worker(c <- chan [Jint, wg *sync.WaitGroup, x, y int) { 
31 invariant acc(c.RecvChannel(), _) 

32 invariant c.RecvGotPerm() == messagePerm{wg, _, x, yh; 

33 invariant ok = > messagePerm{wg, _, x, y}(chunk) 

34 for chunk, ok := <- c; ok; chunk, ok = <-c { 

35 unfold messagePerm{wg, _, x, y}(chunk) 

36 ... // replace x with y in chunk 

37 wg.Done() // same as wg.Add(-1) 

38 } 

39 } 


Fig. 3. Excerpt showing goroutines, channels, and wait groups. The code spawns 
workers (line 14), sends slice chunks through a channel to the workers (line 24), and 
then waits on a wait group (line 26). A worker receives a chunk (line 34), processes it, 
and then notifies the wait group (line 37). For the sake of simplicity, some details were 
omitted. 


receiver to the sender. Initializing a channel also creates send and receive per- 
missions for the channel, which are used to control which threads may access it. 
In our example, we transfer a fraction of the receive permission to each worker 
(line 28). 

The workers receive permission to the chunk they operate on via a message 
sent on line 24 and received on line 34. The transfer back is orchestrated through 
a wait group, which implements an abstract shared counter. Wait groups are used 
as follows: The main thread adds to the counter the number of units of work 
to be done in spawned goroutines (line 22). Each spawned goroutine decreases 
the counter each time a unit of work is done (via a call to Done, line 37). The 
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master can await the counter to reach 0 via a call to Wait (line 26). Gobra uses 
dedicated permissions to express the obligation of a thread to perform units 
of work before decreasing the counter; each time this happens, permissions are 
transferred to the wait group and, eventually to the main thread calling Wait. 
We omit the details here for brevity. 

In our example, this mechanism allows the main thread to recover the permis- 
sions to the entire slice once the workers have terminated. The example in Fig. 3 
illustrates only the permission aspect of the verification. Functional correctness 
can be verified easily based on the explained machinery, by specifying a stronger 
channel invariant that includes the work obligation for each worker. We omit the 
details here, but see the full paper [22] for the complete example. 


3 Encoding 


Gobra encodes an annotated Go program into a Viper program verifying only 
if the input program is correct. Many features of Gobra are also present in 
Viper, making parts of the encoding straightforward. For instance, methods, 
pure methods, and predicates are encoded to their Viper counterpart. Viper’s 
permission model (including fractions, wildcards, and quantifiers) is similar to 
Gobra’s, but memory is represented differently; Viper’s heap is object-based, 
where each object contains all declared fields. Viper’s fields store primitive values 
(including references). To encode Go’s compound values such as structs, arrays, 
slices, and interface values, we use Viper’s mechanism to declare mathematical 
types (such as tuples) using uninterpreted types, uninterpreted functions, and 
appropriate axioms. Exclusive Go values are directly represented using these 
mathematical types. For shared values, there is an indirection via the Viper 
heap to permit aliasing and apply permission-based reasoning. 


Interfaces. As explained in Sect. 2.2, our treatment of Go interfaces relies 
on interface predicates, specification methods, and implementation proofs. We 
explain how we handle the former two here; based on this encoding, the encoding 
of implementation proofs is analogous to methods. 

Intuitively, we encode interface predicates as a case split over all possible 
implementations. All implementations not present in the current scope are sub- 
sumed by an abstract default case. Consequently, adding an implementation does 
not invalidate existing proofs, which enables modular reasoning. The predicate 
for the stream example (Fig. 2) is encoded as follows: 


predicate memory(x: [interfacef{}]) { 
[typeOf(x) == *counter] ? [acc(x.(*counter))] : unknownMemory (x) 


predicate unknownMemory(x: [interface{}]) 
function hasNext(x: [interface{}]) returns (y: [bool]) 


req Jacc(x.memory(), _)] 
ens |[typeOf(x) == *counter] = > y == hasNextproor([x.(*counter)]) 
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The body of the predicate branches on the dynamic type of x, with a single case 
for the (only) given implementation. The abstract predicate unknownMemory 
encodes the default case. The encoding of pure methods such as hasNext uses an 
analogous case split, but uses hasNextproor , Which is part of the implementation 
proof (Fig. 2 line 20) and couples the interface and implementation method. Our 
encoding of interface predicates is an instance of an abstract predicate family [18]. 
For Go, we have crafted a variant that is well-suited for implementation proofs, 
pure interface methods, and structural subtyping. 


First-Class Predicates. Our support for concurrency uses first-class predi- 
cates, for instance, to specify channel invariants (see Sect. 2.3). We encode first- 
class predicate values as mathematical types, using defunctionalization. Pred- 
icate instances are represented by abstract predicates that take the predicate 
value as an argument. First-class predicates enable us to use library stubs to 
support concurrency primitives such as mutexes and wait groups. These stubs 
allow us to encode the use of these concurrency primitives via standard method 
calls. Go’s native channel operations are represented analogously. 


4 Implementation and Evaluation 


The Gobra implementation consists of a parser and type checker for annotated 
Go programs and a translation of those programs into the Viper intermediate 
verification language. The resulting Viper program is verified using Viper’s sym- 
bolic execution backend, which in turn uses the Z3 SMT solver [7]. Verification 
errors are translated back to the Go level, such that users are not exposed to the 
internal encodings. Users never have to inspect the encoding. Error messages 
contain the failing assertion and a reason describing why the assertion failed. 
Gobra’s test suite contains 407 verification tests (with and without errors) with 
a total of 10’030 LOCs (Go code and annotations) that take 14.9 min to verify. 

We evaluated Gobra on 14 interesting verification problems, which include 
well-known algorithms and data structures, and cover Go’s main features, such 
as interfaces (Examples 7-9) and concurrency primitives (Examples 13 and 14), 
including goroutines, mutexes, wait groups, and channels. For each example, 
Gobra verifies memory safety and functional correctness properties. To assess 
Gobra’s performance on failing verifications, we have additionally constructed 
two incorrect variations of each example, one with a seeded error in the specifi- 
cation and one in the implementation. 

All experiments were executed on a warmed-up JVM on a MacBook Pro with 
a 2.3 GHz 8-Core Intel Core i9 CPU and 32 GB of RAM, running macOS 11.1 
and OpenJDK 11. For each experiment, we measured its verification time using 
Viper’s symbolic execution backend and averaged the duration of twelve execu- 
tions, excluding the slowest and fastest outlier. 

Figure 4 summarizes the results, including the required annotations and ver- 
ification times for the three variants of each example. The annotation overhead 
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#|Example LOC / Spec.|Viper LOC] T [s]|T spec error [S]]Timpt error [S] 
1 |binary search tree 125 / 140 632 10.88 10.50 11.67 
2 |dutchflag 22 / 16 142 2.02 IES 1.88 
3 |heapsort 47 / 93 271 16.72 19.30 15.23 
4 |dense and sparse matrix} 69 / 62 326 10.46 10.55 10.06 
5 |binary tree 59 / 20 217 2.09 2.08 2.11 
6 |running ex. (Fig. 1) 10 / 11 164 1.71 1.70 1.70 
7 |running ex. (Fig. 2) 24 / 16 186 1.04 0.98 1.01 
8 |list of interfaces 46 / 27 219 1.45 1.41 1.54 
9 [visitor pattern 76 / 30 475 4.38 4.22 5.45 
10|zune gil // 12 141 1.08 1.07 1.06 
11}relaxed prefix 25 / 36 158 7.08 5.36 4.19 
12|pair insertion sort 50 / 105 353 15.55 12.64 13.96 
13]parallel search replace 35 / 94 565 53.18 51.97 61.54 
14/parallel sum 31 / 98 527 58.39 50.25 57.69 


Fig. 4. Experimental results. For each experiment, we list the number of lines of Go 
code (LOC), number of lines of specification and proof annotations (Spec), and the 
average verification time in seconds for correct examples (T), errors in the specification 
(Tspec error), and errors in the implementation (Timp! error). A line containing both, code 
and annotations, is counted as one line of Go code and one line of annotation. 


ranges between 0.3 and 3.1 lines of annotations per line of code, which is typ- 
ical for SMT-based deductive verifiers. Verification times range between a sec- 
ond and a minute per example. The verification times are significantly higher 
when the verified code uses concurrency features; these examples require quan- 
titatively more and more-complex specifications, which complicates reasoning. 
Lastly, there is hardly any difference between successful and failed verification 
attempts. Consistent performance is crucial when verifiers are used interactively, 
where users run them frequently, especially on programs that do not yet verify. 


5 Related Work and Conclusion 


Besides Gobra, we are aware of two other verification approaches for Go. Peren- 
nial [4] reasons about concurrent, crash-safe systems. Their core techniques are 
an extension to the Iris framework [13] and independent of Go. They connect 
their theory to Go programs with Goose, a shallow embedding of Go into Coq [5], 
which proves that Go code complies with a given transition system. In contrast 
to Gobra, Perennial does not support core Go features such as channels and 
interfaces. 

Several prior works [9,14,15] infer behavioral types [12] to reason about 
Go’s channel-based message passing. After they infer behavioral types for a 
given program, they check safety and liveness properties on the inferred types, 
using model checkers such as mCRL2 [6]. Some works use additional analyses to 
strengthen the provided guarantees. Lange et al. [15] add a termination analysis 
to enable one to verify unbounded properties under certain conditions. Gabet 
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and Yoshida [9] extend this work by inferring behavioral types on shared vari- 
ables and locks to additionally reason about data-race freedom, lock safety, and 
lock liveness. The approaches by Lange et al. [15] and Gabet and Yoshida [9] are 
vastly different from Gobra. They do not verify code contracts, but instead ver- 
ify global properties such as deadlock and data-race freedom. Their automation 
is high and annotation overhead minimal, but their analyses are not modular 
and do not verify functional properties of code. Furthermore, they do not verify 
properties about the state of the heap. 

There are some prior works that can handle channel-based concurrency and 
heap-manipulating programs, but these do not apply directly to Go. Villard 
et al. [20] introduce a powerful contract mechanism to specify protocols that 
channels must adhere to. Their channel specification language is more expressive 
than the one presented in this paper. Their contracts are finite state machines 
and thus can have multiple phases. However, their channels are always shared 
between two peers whereas Go supports more advanced concurrency patterns 
where both channel endpoints are shared between an unbounded number of peers. 
Actris [10,11] is a concurrent separation logic built on top of the Iris framework 
to reason about session types in an interactive theorem prover. Actris can go 
beyond two peers, but to do so, it requires a memory model that is incompatible 
with Go’s memory model. Actris models the sharing of channel endpoints via 
Iris’ ghost locks, which to our knowledge, implies sequentialization of sends, and 
dually receives, which is not guaranteed by Go’s memory model. 

Gobra’s verification logic and encoding into Viper have been inspired by 
several other Viper-based verifiers, such as Nagini [8] for Python, Prusti [1] for 
Rust, and VerCors [2] for Java. None of these verifiers address the Go-specific 
features that Gobra supports. 


Conclusion. We introduced Gobra, the first modular verifier for Go that sup- 
ports reasoning about a crucial aspect of the language: the combination of 
channel-based concurrency and heap-manipulating constructs. Moreover, Gobra 
is the first verifier to support Go’s version of interfaces and structural subtyping. 
In future work, we will expand the properties that can be verified with Gobra, in 
particular to liveness and hyper-properties. Furthermore, we are applying Gobra 
to verify the implementation of a full-fledged network router [23]. Gobra is hosted 
on Github at https://github.com/viperproject /gobra. 
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Abstract. We consider the broad problem of analyzing safety properties 
of asynchronous concurrent programs under arbitrary thread interleav- 
ings. Delay-bounded deterministic scheduling, introduced in prior work, 
is an efficient bug-finding technique to curb the large cost associated 
with full scheduling nondeterminism. In this paper we first present a 
technique to lift the delay bound for the case of finite-domain variable 
programs, thus adding to the efficiency of bug detection the ability to 
prove safety of programs under arbitrary thread interleavings. Second, 
we demonstrate how, combined with predicate abstraction, our technique 
can both refute and verify safety properties of programs with unbounded 
variable domains, even for unbounded thread counts. Previous work has 
established that, for non-trivial concurrency routines, predicate abstrac- 
tion induces a highly complex abstract program semantics. Our tech- 
nique, however, never statically constructs an abstract parametric pro- 
gram; it only requires some abstract-states set to be closed under certain 
actions, thus eliminating the dependence on the existence of verification 
algorithms for abstract programs. We demonstrate the efficiency of our 
technique on many examples used in prior work, and showcase its sim- 
plicity compared to earlier approaches on the unbounded-thread Ticket 
Lock protocol. 


1 Introduction 


Asynchronous concurrent programs consist of a number of threads executing 
in an interleaved fashion and communicating through shared variables, message 
passing, or other means. In such programs, the set of states reachable by one 
thread depends both on the behaviors of the other threads, and on the order 
in which the threads are interleaved to create a global execution. Since the 
thread interleaving is unknown to the program designer, analysis techniques for 
asynchronous programs typically assume the worst case, i.e., that threads can 
interleave arbitrarily; we refer to this assumption as full scheduling nondeter- 
minism. In order to prove safety properties of such programs, we must therefore 
ultimately investigate all possible interleavings. 
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Proposed about a decade ago, delay-bounded deterministic scheduling [10] is 
an effective technique to curb the large cost associated with exploring arbitrary 
thread interleavings. The idea is that permitting a limited number of scheduling 
delays—skipping a thread when it is normally scheduled to execute— in an oth- 
erwise deterministic scheduler approximates a fully nondeterministic scheduler 
from below. Delaying gives rise to a new thread interleaving, potentially reach- 
ing states unreachable to the deterministic scheduler. In the limit, i.e., with 
unbounded delays, the delaying and the fully nondeterministic scheduler permit 
the same set of executions and, thus, reach the same states. 

Prior work has demonstrated that delay-bounded scheduling can “discover 
concurrency bugs efficiently” [10], in the sense that such errors are often detected 
for a small number of permitted delays. The key is that few delays means to 
explore only few interleavings. Thus, under moderate delay bounds, the reachable 
state space can often be explored exhaustively, resulting—if no errors are found— 
in a delay-bounded verification result. 

We build on the empirical insight of efficient delay-bounded bug detection 
(testing) or verification, and make the following contributions. 

1. Delay-bounded scheduling without delay. If no bug is found while 
exhaustively exploring the given program for a given delay budget, we “feel 
good” but are left with an uncertainty as to whether the program is indeed bug- 
free. We present a technique to remove this uncertainty, as follows. We prove 
that the set R(d) of states reached under a delay bound d equals the set R of 
reachable states under arbitrary thread interleavings if two conditions are met: 


— increasing the delay bound by a number roughly equal to the number of 
executing threads produces no additional reachable states, and 
— set R(d) is closed under a certain set of critical program actions. 


In some cases, the set of “critical program actions” may be definable statically at 
the language level; in others, this must be determined per individual action. To 
increase the chance that the above two conditions eventually hold, we typically 
work with conservative abstractions of R; the (precisely computed) abstract 
reachability set R is then used to decide whether the program is safe. 

2. Efficient delay-unbounded analysis. We translate the above founda- 
tional result into an efficient delay-unbounded analysis algorithm. It starts with 
a deterministic Round-Robin scheduler, parameterized by the number of rounds 
r it runs and of delays d it permits, and increases r and d in a delicate schedule 
weak-until the two conditions above hold (it is not guaranteed that they ever 
will). The key for efficiency is that the reachability sets under increasing r and d 
are monotone. We therefore can determine reachability under parameters r’ > r 
and d’ > d starting from a frontier of the states reached under bounds r and d. 
We present this algorithm and prove it correct. We also prove its termination 
(either finding a bug or proving correctness), under certain conditions. 

3. Delay-unbounded analysis for general infinite-state systems. We 
demonstrate the power of our technique on programs with unbounded-domain 
variables and unbounded thread counts. The existence of integer-like variables 
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suggests the use of a form of predicate abstraction. Prior work has shown that 
predicate abstraction for unbounded-thread concurrent programs leads to com- 
plex abstract program semantics [8,15], going beyond even the rich class of 
well-quasiordered systems [1]. Our delay-unbounded analysis technique does not 
require an abstract program. Instead, we add to the idea of reachability analysis 
under increasing r and d a third dimension n, representing increasing thread 
counts, enjoying a similar convergence property. Circumventing the static con- 
struction of the abstract program simplifies the verification process dramatically. 

In summary, this paper presents a technique to lift the bound used in delay- 
bounded scheduling, while (empirically) avoiding the combinatorial explosion of 
arbitrary thread interleavings. Our technique can therefore find bugs as well as 
prove programs bug-free. We demonstrate its efficiency using concurrent push- 
down system benchmarks, as well as known-to-be-hard infinite-state protocols 
such as the Ticket Lock [3]. We offer a detailed analysis of internal performance 
aspects of our algorithm, as well as a comparison with several alternative tech- 
niques. We attribute the superiority of our method to the retained parsimony of 
limited-delay deterministic-schedule exploration. 

A full version of this paper, with proofs omitted here and other supplementary 
information, can be found in an accompanying Technical Report [14]. 


2 Delay-Bounded Scheduling 


2.1 Basic Computational Model 


For the purposes of introducing the idea behind delay-bounded scheduling, we 
define a deliberately broad asynchronous program model. Consider a multi- 
threaded program P consisting of n threads. We fix this number throughout 
the paper up to and including Sect. 5.1, after which we consider parameterized 
scenarios. Each thread runs its own procedure and communicates with others 
via shared program variables. A “procedure” is a collection of actions (such as 
those defined by program statements). We define a shared-states set G and, for 
each thread, a local-states set L; (0 <i < n). A global program state is therefore 
an element of G x i pare L;. In addition, a finite number of states are designated 
as initial. (Finiteness is required in Sect. 4 for a termination argument [Lemma 
11].) 

The execution model we assume in this paper is asynchronous. A step is a 
pair (s,s’) of states such that there exists a thread i (0 < i < n) such that s 
and s’ agree on the local states of all threads j Æ i; the local state of thread i 
may have changed, as well as the shared state. We say thread i executes during 
the step, by executing some action of its procedure.! The execution semantics 
within the procedure is left to the thread (e.g., there may be multiple enabled 
actions in a state, an action may itself be nondeterministic, etc.). Without loss of 
generality for safety properties, we assume that the transition relation induced 


1 If only the shared state changes, it is possible that the identity of the executing 
thread is not unique. This small ambiguity is inconsequential for this paper. 
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by each thread’s possible actions be total. That is, instead of an action x being 
disabled for a thread in state s, we stipulate that firing x from s results in s. 

A path is a sequence p = (So,...,5,) of states such that, for 0 < i < l, 
(Si, 5:41) is a step. This path has length l (= number of steps taken). A state s 
is reachable if there exists a path from some initial state to s. We denote by R 
the (possibly infinite) set of states reachable in P. Note that these definitions 
permit arbitrary asynchronous thread interleavings. 


2.2 Free and Round-Robin Scheduling 


We formalize the notion of a scheduling policy indirectly, by parameterizing the 
concept of reachability by the chosen scheduler. A state s is reachable under 
free scheduling if there exists a path p = (so,..., 8) from some initial state 
So to sı = s. A free scheduler is simulated in state space explorers using full 
nondeterminism. State s is reachable under n-thread Round-Robin scheduling 
with round bound r if there exists a path p = (so,..., s1) from some initial state 
So to sı = s such that 


1. [l/n] <r, and 
2. for 0 < i< l, thread i (mod n) executes during step (si, Si+1). 


2.3 Delay-Bounded Round-Robin Scheduling 


We approximate the set of states reachable under free scheduling from below, 
using a relaxed Round-Robin scheduler. The scheduler introduced so far is, how- 
ever, deterministic and thus vastly underapproximates the free scheduler, even 
for unbounded r. The solution proposed in earlier work is to introduce a limited 
number d of scheduling delays [10]. A delayed thread is skipped in the current 
round and must wait until the next round. 


Definition 1. State s is reachable under Round-Robin scheduling with 
round bound r and delay bound d (“reachable under RR(r,d) scheduling” 
for short) if there exists a path p = (so,...,s1) from some initial state so to 
sı = s and a function f : {0,...,l1—1} > {0,...,n—1}, called scheduling 
function, such that 

1. for dp := f(0) ae Sr ((f(é) — fli —1)-— 1) mod n), we have dp < d, 

2. ptt] <r (dp as defined in 1.), and 

3. forO0 < i< l, thread f(i) executes during step (Si, Si+1). 

Variable dp from 1. quantifies the total delay, compared to a perfect Round- 
Robin scheduler, that the scheduling along path p has accumulated. Consider the 
case of n = 4 threads TO,...,T3. Then the scheduling sequence (f(0),..., f(11)) 
below on the left, of l = 12 steps and involving 13 states, follows a perfect 
Round-Robin schedule of r = 3 rounds (separated by |): 


01231012310123 01X31012X1X123 
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The sequence on the right of l = 9 steps follows a Round-Robin scheduling 
of r = 3 rounds and a total of d, = 3 delays: one after the second step (T2 is 
delayed: 3 — 1 — 1 mod 4 = 1), another two delays after the sixth step (T3 and 
TO are delayed: 1 — 2 — 1 mod 4 = 2). The final state of this path is reachable 
under RR(3,3) scheduling. Note that delays effectively shorten rounds. 

We denote by R(r, d) the set of states reachable in P under RR(r, d) schedul- 
ing. (Note that this set is finite, for any program P.) It is easy to see that, given 
sufficiently large r and d, any schedule can be realized under RR(r, d) scheduling: 


Theorem 2. State s is reachable under free scheduling iff there exist r,d such 
that s is reachable under RR(r,d) scheduling: R = U, gen R(r, d). 


State-space exploration under free scheduling can therefore be reduced to enu- 
merating the two-dimensional parameter space (r, d) and computing states reach- 
able under RR(r,d) scheduling. This can be used to turn a Round Robin-based 
state explorer into a semi-algorithm, dubbed delay-bounded tester in [10]. 

An important property of the round and delay bounds is that increasing 
them can only increase the reachability sets: 


Property 3 (Monotonicity in r & d). For any round and delay bounds r 
and d: 
R(r,d) C R(r+1,d) , R(r,d) CR(r,d +1). (1) 


This follows from the ... < r and... < d constraints in Definition 1. The prop- 
erty relies on r and d being external to the program, not accessible inside it. 
Under this provision, monotonicity in any kind of resource bound is a fairly nat- 
ural yet not always guaranteed property; we give a counterexample in Sect. 5.2. 


3 Abstract Closure for Delay-Bounded Analysis 


The goal of this paper is a technique to prove safety properties of asynchronous 
programs under arbitrary thread schedules. Theorem 2 affords us the possibility 
to reduce the exploration of such arbitrary schedules to certain bounded Round- 
Robin schedules, but we still need to deal with those bounds. In this section we 
present a closure property for bounded Round-Robin explorations. 


3.1 Respectful Actions 


Let S be the set of global program states of P, and let a: S — A be an abstrac- 
tion function, i.e., a function that maps program states to elements of some 
abstract domain A. Function a typically hides certain parts of the information 
contained in a state, but the exact definition is immaterial for this subsection. 

A key ingredient of the technique proposed in this paper is to identify actions 
of the program executed by a thread with the property that the abstract succes- 
sor of an abstract state under such an action does not depend on concrete-state 
information hidden by the abstraction. 


Delay-Bounded Scheduling Without Delay! 385 


Definition 4. Let x be a program action, and let the relation s “% 4! denote 
that s > s' is a step during which thread i executes x. Action x respects a if, 
for all states 51, 82, 84,85 E S and alli:0<i<n: 


a(s;) = a(se ia oe A s Ds => a(s})=a(s5) . 2 
1 2 1 


Intuitively, “x respects a’ means that successors under action x of a-equiv- 
alent states all have the same unique abstraction. Note the special case s1 = s9, 
s, Æ sù: for nondeterministic actions x to respect a, multiple successors s}, sh of 
the same concrete state sı = s2 under x also must have the same abstraction. 


Example 5. Consider n-thread concurrent pushdown systems (CPDS), an 
instance of the asynchronous computational model presented in Sect. 2.1. We 
have a finite set of shared states readable and writeable by each thread. Each 
thread also has a finite-alphabet stack, which it can operate on by (i) overwrit- 
ing the top-of-the-stack element, (ii) pushing an element onto the stack, or (iii) 
popping an element off the top of the non-empty stack. The classic pointwise 
top-of-the-stack abstraction function is defined by 


a(g, Wo,+-+,Wn—1) = (9, 00; - - -, On—1) , (3) 


where g is the shared state (unchanged by a), wi is the contents of the stack of 
thread i, and o; is the top of wi if wi is non-empty, and empty otherwise [18]. 
Note that the domain into which a maps is a finite set. 

Push and overwrite actions respect a, while pop actions disrespect it: consider 
the case n = 1 and sı = (g, wo) = (0,10) and s2 = (0,11), with stack contents 10 
and 11, resp. (left = top). While a(s1) = a(s2) = (0,1), the (unique) successor 
states of sı and s2 after a pop are not a-equivalent: the elements 0 and 1 emerge 
as the new top-of-the-stack symbols, respectively, which a can distinguish. 


The notion of respectful actions gives rise to a condition on sets of abstract 
states that we will later use for convergence proofs: 


Definition 6. An abstract-state set A is closed under actions disrespect- 
ing a if, for every a € A and every successor a! of a under a disrespectful 
action, a’ € A. 


For maximum precision: a’ is said to be a successor of a under a disrespectful 
action if there exist concrete states s and s’, a thread id 7 and an action x such 
that a(s) = a, a(s") =a’, x disrespects a, and s “5 s’. If abstraction a is clear 
from the context, we may just say “closed under disrespectful actions”. 


3.2 From Delay-Bounded to Delay-Unbounded Analysis 


We now present our idea to turn a round- and delay-bounded tester into a 
(partial) verifier, namely by exploring the given asynchronous program for a 
number of round and delay bounds until we have “seen enough”. Recall the 
notations R and R(r,d) defined in Sect. 2. We also use R and R(r,d) short for 
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a(R) and a(R(r,d)), i.e. the respective abstract reachability sets. (Note that 
R is not an abstract fixed point—instead, it is the result of applying a to the 
concrete reachability set R; see discussion in Sect. 7.) 


Theorem 7. For any r,d € N, if R(r,d) = R(r+1,d+n-—1) and R(r,d) is 
closed under actions disrespecting a, then R(r,d) = R. 


The theorem states: if the set of abstract states reachable under RR(r, d) schedul- 
ing does not change after increasing the round bound by 1 and the delay bound 
by n — 1, and it is closed under disrespectful actions, then R(r, d) is in fact the 
exact set R of abstract states reachable under a free scheduler: no approximation, 
no rounds, no delays, no Round-Robin. 


Proof. of Theorem 7: we have to show that R(r,d) is closed under the abstract 
image function Im induced by a, defined as 


Im(a) = {a': ds,s': a(s) =a, a(s’)=a', sa s'}. 


That is, we wish to show Im(R(r,d)) C R(r,d), which proves that no more 


abstract states are reachable. Consider a € R(r,d) and a’ € Im(a), i.e. we have 


states s,s’ such that a(s) = a, a(s’) =a’, and s “+ s! for some thread i and 
some action x. The goal is to show that a’ € R(r, d). 


To this end, we distinguish flavors of «x. If x disrespects a, then a’ € R(r, d), 
since the set is closed under disrespectful actions. 

So x respects a. Since a € R(r,d), there exists a state so € R(r,d) with 
a(so) = a. Suppose for a moment that thread i is scheduled to run in state so. 
Then it can execute action x; any successor state so satisfies sọ € R(r,d), and: 


Gh PE aig OE aia eS” al Rd) = Bina) « 
But what if the thread scheduled to run in state so under RR(r, d) scheduling, 
call it j, is not thread i? Then we delay any threads that are scheduled before 
thread 7’s next turn; if 2 < j, this “wraps around”, and we need to advance to the 
next round. The program state has not changed—we are still in sọ. Let sọ be the 
successor state obtained when thread i now executes action x, and A(i, j) = 1 if 
i < j, 0 otherwise. Then we have sọ E€ R(r + A(i,7),d+ (j — i) mod n), and: 


y (def a!) i (« resp. a) pn, Adee s02) — Ma: wi 
ad = al) = als) €  R(r+Ali,j),d+(j-—i)mod n) 
(monot. r,d) __ n == 
C R(r+i,d+tn-1)°=”" R(r,d). 


This concludes the proof of Theorem 7. 


Example 8. Consider a simple 3-thread system with a shared-states set G = 
{0,1,2}. The local state of each thread is immaterial; function a just returns the 
shared state: a(g, lo, l1,l2) = g. The threads’ procedures consist of the following 
actions, which update only the shared state: 
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Thread T0: 0 — 1 Thread T1: 0 — 1 Thread T2: 0 — 2 


Table 1 shows the set of reachable states for different round and delay bounds. 
For example, with one round and zero delays, the only feasible action is T0’s. 
The reachable states are 0 (initial) and 1 (found by T0). The table shows a path 
to a pair (r,d) that meets the conditions of Theorem 7. From (r,d) = (1,0) we 
increment r to find a plateau in r of length 1. We then increase d to try to find 
a plateau in d of lengthn—1= 2. This example shows that a delay plateau of 
length 1 is not enough, as 2 is only reachable at least 2 delays. At (2,2) we find 
a new state (2), so we restart the search for plateaus in r and d. At (3,4), the 
plateau conditions for Theorem 7 are met. There are no disrespectful transitions, 


so by Theorem 7, we know that R(3,4) = R 


Table 1. Reachable states in Example 8 under various round and delay bounds. The 
boxed set passes the convergence test suggested by Theorem 7 


d=0 d=1 d=2 d=3 d=4 
C= 1) {0,1} {0,1} {0,1,2} {0,1,2} {0,1,2} 
r=2 {0.1} —>+ {0,1} — {0,1,2} {0,1,2} {0,1,2} 
r=3) {0,1} {0,1} LA > {0,1,2} + {0,1,2} 


4 Efficient Delay-Unbounded Analysis 


Turning Theorem 7 into a reachability algorithm requires efficient computation of 
the sets R(r, d). This section presents an approach to achieve this, by expanding 
only frontier states when either the round or the delay parameter is increased. 

To this end, let C be a state property (such as an assertion) that respects a, 
in the sense that, for any states s1, s2, if a(s1) = a(s2), then sı E C iff s2 = C. 
From now on, we further assume the domain A of abstraction function a to be 
finite, which will ensure termination of our algorithm (see Lemma 11 later). 

Our verification scheme for C is shown in Algorithm 1, which uses Algorithm 
2 as a subroutine. In the rest of this paper, we also refer to Algorithm 1 as Delay- 
(and round-) UnBounded Analysis, DrUBA for short. 

The main data structure used in the algorithms is that of a State, which 
stores both program variables and scheduling information, in the attributes 
finder, rounds_taken, and delays_taken. For a state s, variables s.rounds_taken 
and s.delays_taken represent the number of times the scheduler started a round 
and delayed a thread, resp., to get to s. Variable s.finder contains the index 
of the thread whose action produced s. This is enough information to continue 
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Algorithm 1. Verifying property C against all reachable states of program P 


Input: n-thread asynchronous program, property C 
Output: “safe”, “violation of C”, or “unknown” 

1: Reached := (finite) set of initial states > Reached: states reached so far 
2: r := 0; d :=0 

3: repeat 

4 Frontier := {s € Reached : s.rounds_taken = r} 

5: r++ 

6 for s € Frontier do 

7 Reached := Reached U FinishRounds(s,r + 1, C) 

8: r++ 

9: until round plateau of length 1 


10: repeat 

11: Frontier := {s € Reached : s.delays_taken = d} 

12: d++ 

13: for s € Frontier do 

14: s =s > copy of state s 
15: s'.delays_taken++ 

16: s' finder := (s'.finder + 1) mod n 

17: if s'.finder mod n = 0 then 

18: s'.rounds_taken++ 

19: Reached := Reached U FinishRounds(s', r, C) 

20: if new abstract state found during for loop in Line 13 then 

21; goto 3 > abort second repeat loop; go back to first 


22: until delay plateau of length n — 1 
23: if a(Reached) is closed under disrespectful actions then 


24: return “safe” 
25: else 
26: return “unknown” 


Algorithm 2. FinishRounds(s,r,C) 


Input: s: state, r: round bound, C: state property 
Output: states reachable from s up to round bound r, without delaying 
Set<State> Unexplored := {s}, Reached := {} 
while Unezplored != {} do 
select and remove some state u from Unexplored 
if u violates C then 
throw “violation of C (witnessed by reaching state u)” 
Reached := Reached U {u} 
if u.finder < n— 1 or u.rounds_taken < r then > if u schedulable 
Unexplored := Unexplored U (Image(u) \ Reached) 
return Reached 
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the execution from s later, starting with the thread after finder. For the initial 
states, rounds _taken and delays_taken are zero, and finder is n— 1 (the latter so 
that expanding the initial states starts with thread (n — 1) +1 mod n = 0). For 
set membership testing, two states are considered equal when they agree on their 
finders and on program variables. The rounds_taken and delays_taken variables 
are for scheduling purposes only and ignored when checking for equality. 

As mentioned in Prop. 3, the sequence of reachability sets is monotone with 
respect to both rounds and delays, for any program. This entails two useful 
properties for Algorithm 1. First, we can increase the bounds in any order and 
at individual rates. Second, it suffices to expand states at the frontier of the 
exploration, without missing new schedules. When adding a new delay, we only 
need to delay those states that were (first) found in schedules using the maximum 
delays. When adding a round, we only need to expand states that were (first) 
found in the last round of a schedule. 

Algorithm 1 first advances the round parameter r until a round plateau has 
been reached (Lines 3-9). It does so by running the FinishRounds function on 
frontier states s: those that were reached in the final round r of the previous 
round iteration. FinishRounds (Algorithm 2) explores from the given state s, 
Round-Robin style, up to the given round, without delaying any thread. The 
actual expansion of a state happens in function Image (Line 8 of Algorithm 2), 
which computes a state’s successors and initializes their scheduling variables: 
rounds taken and delays_taken are copied from u, the finder of the successor is 
the next thread (+1 mod n). If this wraps around, rounds_taken is incremented 
as well. 

Back to the main Algorithm 1: we have reached a round plateau of length 1 
if the entire for loop in Line 6 sees no new abstract states (no new elements in 
a(Reached)). If so, we are not ready yet to perform the convergence test (recall 
Example 8). Instead, Algorithm 1 now similarly advances the delay parameter 
d (Lines 10-22). For each frontier state (delays_taken = d), we delay the thread 
scheduled to execute from this state (by incrementing (mod n) the finder vari- 
able), and record the taken delay (Line 15). Then we again call the FinishRounds 
function and merge in the states found. Importantly, these merges preserve states 
already in Reached, meaning that the algorithm will keep states found earlier in 
the exploration (with smaller r, d). 

The loop beginning in Line 10 repeats until a delay plateau of length n — 
1 is encountered (as required by Theorem 7). This means that during n — 1 
consecutive repeat iterations, the for loop in 13 did not find any new abstract 
states. When the round and delay plateaus have the required lengths (1 and 
n—1, resp.), we invoke the convergence test (Line 23), which amounts to applying 
Theorem 7. If the test fails, Algorithm 1 returns “unknown”. 

Towards proving partial correctness of Algorithm 1, we first show that the 
states eventually collected in set Reached by the algorithm correspond exactly 
to the round- and delay-bounded reachability sets R(r, d), and that—after the 
two main repeat loops—a plateau of sufficient length has been generated. As 
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a corollary, the algorithm is partially correct, i.e. it returns correct answers if it 
terminates. 


Lemma 9. If Algorithm 1 reaches Line 28, the current values ofr and d satisfy: 
(i) Reached = R(r,d), and (ii) R(r — 1,d — (n — 1)) = R(r,d). 


Corollary 10. The answers “safe” and “violation of C” returned by Algorithm 
1 are correct. 


The algorithm won’t return either “safe” or “violation of C” in one of two 
situations: when the convergence test fails in Line 23 (it gives up), and when it 
fails to ever reach this line. The latter can be prevented using a finite-domain a: 


Lemma 11. If the domain A of abstraction function a is finite, Algorithm 1 
terminates on every input. 


Since abstraction a approximates the information contained in a state, a plateau 
may be intermediate, e.g. R(1,0) € R(1,1) = R(2,2) © R(2,3). Thus, stopping 
the exploration simply on account of encountering a plateau—even of lengths 
(1,n —1)—is unsound. Intermediate plateaus make our algorithm (unavoidably) 
incomplete: if the test in Line 23 fails, then there are known-to-be-reachable 
abstract states with abstract successors whose reachability cannot be decided at 
that moment. If we knew the plateau to be intermediate, we could keep exploring 
the sets R(r, d) for larger values of r and d until the next plateau emerges, hoping 
that the convergence test succeeds at that time. In general, however, we cannot 
distinguish intermediate from final plateaus. 


5 DrUBA with Unbounded-Domain Variables 


In addition to unbounded control structures like stacks, which come up in push- 
down systems and were discussed in Ex 5, infinite state spaces in programs are 
often due to (nominally) unbounded-domain program variables. This presents 
no problem for the computation of the concrete reachability sets Reached in 
Algorithm 1: for any round and delay bounds (r,d), the set of concrete reach- 
able states RR(r,d) is finite and thus explicitly computable (no symbolic data 
structures are needed).? On the other hand, termination of the same algorithm 
requires that it eventually reach a plateau in r and d of sufficient length. This is 
guaranteed by an abstraction function a that maps concrete states into a finite 
abstract space. A finite abstract domain is therefore highly desirable. 

A generic abstraction that reduces an unbounded data domain to a finite one 
is predicate abstraction [4,13, see [14] for a short primer]. The goal in this section 
is to demonstrate how the simple scheme of delay-unbounded analysis can be 
combined with predicate abstraction to verify unbounded-thread programs. 


? Contrast this to a contezt-switch bound, under which reachability sets can be infinite. 
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5.1 The Fixed-Thread Case 


Consider program P in Fig.1 on the left [8, page 4: program P”]. Intuitively, 
variable m counts the number of threads spawned to execute P concurrently. It 
is easy to see that “the assertion in [P] cannot be violated, no matter how many 
threads execute [P], since no thread but the first will manage to” [8] enter the 
true branch of the if statement and reach the assertion. 


Program P NI! 2 3 4 5 6 7 
shared int m := 0 0 
shared int s := 0 
local int l := 0 1 
0: m++ 
1: if m = 1 then 2 
2; s++, [++ 3 l 
3: assert s = | 
4: goto 2 4 


Fig. 1. Left: program P; Right: how Algorithm 1 operates on (an abstraction of) it 


Previous work has shown that even the 1-thread version of this program 
cannot be proved correct using predicate abstraction unless we permit predi- 
cates that depend on both shared and local variables [9, for the unprovability 
result], which have been referred to as mixed [8]. An example is the predicate 
p :: (s = 1), which comes up in the assertion. The dependence of p on both shared 
and thread-local data causes standard solutions that track the truth value of p in 
a shared or local Boolean variable to be unsound. The solution proposed in [8] 
is to use broadcast instructions to have the executing thread notify all other 
threads whenever the truth value of p changes. This solution comes with two 
disadvantages: (i) the resulting Boolean broadcast programs are more expensive 
to analyze than strictly asynchronous Boolean programs, and (ii) the solution 
cannot be extended to the unbounded-thread case. 

Let us consider how we can verify this program using Algorithm 1, for the 
fixed-thread case; we consider n = 2 threads. We will have to use mixed predi- 
cates as in [8], but since we never execute the abstract Boolean program, there 
is no need for constructing it. As a result, there is no need for broadcast instruc- 
tions. 

The program generates an unbounded number of reachable concrete states, 
but we explore it only under round and delay bounds r and d. As per Algorithm 
1, we increase these bounds until we have reached plateaus of lengths 1 and 
n — 1 = 1, resp. Plateaus are determined over the abstract-state set, so we need 
a function a. 
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First attempt: a single predicate. We define a as follows, for a concrete state c: 
ailc) = (cpcp, cs = clo) €0,...,4 {0,1}, 


where c¢.pcg, clo, and c.s are the values of thread 0’s pe and local variable J, 
and of shared variable s, in state c, respectively. The function extracts from a 
concrete state the current program location of thread 0 and the value of predicate 
p :: (s = L) for thread 0.° The only statement not respecting a is the if statement 
in Line 1: here, the new value of the pc cannot be determined from the current 
values of pc and predicate p alone. All other statements respect a1. 

We can now perform an iterative exploration of this program—bounded but 
exhaustive within each bound. In Fig.1 on the right, red arrows denote “new 
abstract state reached”. A red horizontal arrow (r++) means: “keep increasing r”. 
A red vertical arrow (d++) means: “switch to increasing r”. In other words, 
following a red arrow—no matter the direction—we always go “right” (r++). The 
green horizontal arrow followed by a green vertical arrow at the end indicates that 
we have reached the first plateaus of length 1 in both directions: at (r, d) = (7,4). 

At this point we have reached a total of 7 abstract states. State (3,0) (pc = 
3, s Æ l) is not among them, so the assertion has not been violated so far. 
We run the convergence test, to determine whether set R(7,4) is closed under 
disrespectful actions. Since the if in Line 1 is the only disrespectful statement, we 
only need to check successors of abstract states of the form (1,7?) (i.e., with pe = 
1). Unfortunately, R(7,4) contains abstract state (1,0) (a reachable abstract 
state) but not its abstract successor (2,0). This state is unreachable, but we do 
not know that at this point. This causes Algorithm 1 to return “unknown”. 


Second attempt: two predicates. The disrespectful action causing the failure sug- 
gests that we need to keep track of whether the branch in Line 1 can be taken, 
i.e. whether m = 1. We refine our abstraction using this (non-mixed) predicate: 


alc) = (cpcp, cs = clo, cem = 1) €0,...,4x {0,1}?. (4) 


The abstract successors of the if statement can now be decided based only on 
knowledge provided by ag, i.e. the statement respects a2. There is, however, 
another statement disrespecting a2, and only one: the increment m++ in Line 0. 
If m Æ 1, we cannot decide whether m = 1 will be true after the increment. 
We again perform our iterative exploration of this program, and find the first 
suitable plateau at the same point (r,d) = (7,4). This time, however, we have 
reached a total of 12 abstract states (all of them “safe”). We run the convergence 
test: we only need to check already reached abstract states of the form (0, ?,0) 
(pe = 0, m # 1). Set R(7,4) contains exactly one state of this form: (0, 1,0), 
which m++ can turn into (1,1,0) and (1,1,1)—note that the next pc value is 
unambiguous (1), and predicate s = l is not affected. The good news is now 
that both abstract states (1,1,0) and (1,1,1) are contained in R(7,4). This 
proves this set closed under disrespectful actions; Algorithm 1 terminates: the 
assertion is safe for any execution schedule, for the case of n = 2 threads. 


3 Tracking these values for thread 0 suffices: the multi-threaded program is symmetric. 
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We summarize that, in our solution above, we assumed a lucky hand in pick- 
ing predicates—the question of predicate discovery is orthogonal to the delay- 
unbounded analysis scheme. However, the proof obtained using Algorithm 1 does 
not involve costly broadcast operations, previously proposed as an ingredient to 
extend predicate abstraction to concurrent programs. A second, more powerful 
advantage is that, unlike the earlier broadcast solution, Algorithm 1 extends 
gracefully to the unbounded-thread case. This is the topic of the rest of this 
section. 


5.2 The Unbounded-Thread Case 


The goal now is to investigate whether an asynchronous unbounded-domain 
variable program is safe for arbitrary thread counts (and thread interleavings). 


Existing solutions. We are aware of only one general technique that combines 
predicate abstraction with unbounded-thread concurrency [15]. That technique 
can achieve the above goal, roughly as follows. In addition to standard and 
mixed predicates used also in the fixed-thread case, we now permit inter-thread 
predicates, which quantify over all threads other than the executing one. Such 
predicates allow us to express, for example, that a thread’s local variable l’s 
value is larger than that of any other thread: Vi: i Æ self : l > 1;. Predicates of 
this type are provably required during predicate abstraction to verify the safety 
of the Ticket Lock algorithm [3,15]. 

Abstraction against inter-thread predicates leads to a dual-reference pro- 
gram [15], a process that is already far more complex than standard sequen- 
tial or even fixed-thread predicate abstraction. But we pay another price for 
using these predicates: namely, the loss of monotonicity of the transition rela- 
tion w.r.t. a standard well-quasiordering x on infinite state sets of unbounded- 
thread Boolean programs. In this context, monotonicity states, roughly, that 
adding passive threads to a valid transition keeps the transition intact. 

This price is heavy, since monotonicity w.r.t. < would have given us a 
well-quasiordered infinite-state transition system, for which local-state reach- 
ability properties are decidable [1]; working implementations exist. The above- 
mentioned prior work attempts to salvage the situation, by adding a set of tran- 
sitions (the non-monotone fragment) to the dual-reference program that restore 
monotonicity and further overapproximate but without affecting the reachability 
of unsafe states [15]. 


Alternative solution. We now propose a solution that uses the same type of inter- 
thread predicates (this is inevitable), but renders dual-reference programs, the 
monotone closure of the transition relation and all other “overhead” introduced 
in [15] unnecessary. We will use Algorithm 1 as a sub-routine. 

The idea is as follows. Sect. 5.1 suggests a way to verify fixed-thread asyn- 
chronous programs, using a combination of predicate abstraction and Algorithm 
1 . To handle the unbounded-thread case, we wrap another layer of incremental 
resource bounding around this combined algorithm—the “resource” this time is 
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the number n of threads executing the program. For each member of a sequence 
of increasing fixed thread counts we compute the set of abstract states reachable 
under arbitrary thread interleavings. This is purely a sub-routine; we will use 
the method proposed in Sect. 5.1 (others are possible, e.g. [8]). 

The incremental (in n) analysis proceeds until we have reached a thread 
plateau of length 1, and then run the convergence test: we check the current 
abstract reachability set for closure under disrespectful actions. This time, the 
abstract transitions must take into account that the number of executing threads 
is unknown. It is easy to see that a plateau of length 1 is sufficient: we compute 
the set of abstract states reachable under arbitrary thread schedules; thus, the 
obstacle of non-schedulability of thread i in the proof of Theorem 7 that forced 
us to wait for a (delay) plateau of length n — 1 does not apply here. 


A non-monotone resource parameterization 

Before we demonstrate this idea on program P, we justify our strategy of combin- 
ing resource bounds. The idea presented above can be viewed as a multi-resource 
analysis problem where we increment r and d in an “inner loop” (represented 
by Algorithm 1 as a sub-routine to compute fixed-thread reachability sets), and 
n in an outer loop. Both loops compute monotonously increasing reachability 
sequences: for “inner” this is Prop. 3; for “outer” this is easy to see. Theorem 7 
relies upon the monotonicity: without it, the test R(r,d) = R(r +1,d +n — 1) 
makes the algorithm unsound. 

The way we nest the three involved resource parameters is not arbitrary: 
Round-Robin reachability under an increasing thread count is not monotone. 
More precisely, making the thread-count parameter n explicit, let R(r,d,n) 
denote the set of states reachable in the n-thread program P under RR(r, d) 
scheduling. Then R(r,d,n) C R(r,d,n+ 1) is not valid. The following example 
illustrates this (at first counter-intuitive) monotonicity violation: 


Example 12. Consider the asynchron- 


ous Boolean program over shared varin- shared bool s := 0, t := 0 
ables s and t on the right. Here we have| 9: t := !t 

R(3,0,1) Z R(3,0,2): given 1 thread 1: if t then 

(sequential execution), a state with 2: g= i 


s = 1 is reachable. With 2 symmetric threads, under delay-free Round-Robin 
scheduling (d = 0), the first and second thread will repeatedly flip t to 1 and back 
to 0, resp., before either one has a chance to get past the guard in Line 1. 

A stronger result is: for allr € N, R(3,0,1) Z R(r,0,2), i.e. we cannot make 
up for the poor scheduling of the second thread by adding more rounds. 
The consequence for us is that we cannot compute, for fixed r,d, the sets 
R(r,d,co), using the closure-under-disrespectful-actions paradigm. Instead we 
must, for each n, compute R(co,0oo,n) (using Algorithm 1 or otherwise) and 
increase n in the outer loop. 
Verifying program P for unbounded thread count 


We recall that, given the two predicates shown in Eq. (4) and the pc, we were 
able to verify program P correct (under arbitrary thread interleavings) for n = 2 
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threads; a total of 12 abstract states were reached (out of 5 - 2? = 20 possible). 
Advancing the outer loop, we invoke Algorithm 1 for n = 3 threads. This reveals 
another reachable abstract state, namely pc = 0, s £1, m 4 1. Unfortunately, 
this state causes Algorithm 1 to return “unknown”: under a2, one currently 
unreached abstract successor is pc = 1, s # l, m = 1, violating closure. Observing 
that a thread executing Line 1 with m = 1 must be the first thread executing, 
we try tracking the initial value of m: 


asle) = (e.peg, cs = clo, em = 1, ccm = 0) € 0,...,4 x {0,1}%. (5) 


Interestingly, all actions (statements) of program P respect abstraction ag. This 
means that the test for closure under disrespectful actions is vacuously true—we 
can stop as soon as we have reached a plateau in n of length 1. We don’t have 
to wait long for this plateau: we invoke Algorithm 1 for n = 3 and n = 4 under 
abstraction a3. (Note that n = 4 requires a longer plateau than n = 3.) The 
abstract reachability sets consist of the same 14 abstract states in both cases. We 
report the program safe, for arbitrary interleavings and arbitrary thread counts. 
We can also report the exact set of 14 reachable abstract states. 

We again summarize that, while we still (and unavoidably) use mixed pred- 
icates, we do not construct a thread-parameterized abstract program, which 
would require broadcast statements [8] and a rather involved dual-reference tran- 
sition semantics [15]. In fact, we did not even need to test for closure under any 
abstract images, since the chosen abstraction enjoys respect from all actions. 


6 Evaluation 


Our goal for the evaluation of DrUBA was to answer the following questions: 


How does DrUBA compare to abstract fixed-point computation (“AT”)? 
How does DrUBA compare to the approach from [18] (“CUBA”)? 

How expensive is the state exploration along a plateau in Algorithm 1? 
What is the performance benefit of the frontier optimization in Algorithm 1? 


Pon Pr 


Questions 1 and 2 serve to compare DrUBA against other techniques; Questions 
3 and 4 investigate features of Algorithm 1. 

To this end we implemented, in Java 11, a verifier using Algorithm 1 that 
takes concurrent pushdown systems as input; we refer to this verifier as DrUBA 
in this section.4 We also implemented the AI approach in Java 11. For the 
comparison with the context-unbounded approach, we used a publicly available 
tool. Our experiments are based on the concurrent benchmark programs also 
used in [18]. The experiments are performed on a 3.20GHz Intel i5 PC. The 
memory limit was 8GB, with a timeout of 1h. 


4 DrUBA implementation available at https: //doi.org/10.5281/zenodo.4726301. 
5 https: //github.com/Ipzun/cuba. 
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6.1 Results 


Table2 reports the benchmark names, the thread counts, and the size of the 
reachable abstract state space (columns 1-3). The second part of the table shows 
the time it took each verifier to fully explore the state space and confirm conver- 
gence. For the AI approach, we check whether the abstract state space is closed 
under all operations each time either r or d is incremented. Algorithm 1 was 
faster than “AI” on every example except Stefan-4,5. Stefan is the only pro- 
gram that actually does not require any delays to discover all reachable abstract 
states. The results indicate that the AI approach spends approximately half of its 
computation time doing repeated convergence tests after each bound increment. 
Furthermore, as state sets increase in size, AI seems to take even longer, as with 
the Bluetooth3 (2+3) example. The convergence test needed for “AI” includes 
checking closure under both respectful and disrespectful actions, making it more 
costly than the one used in Algorithm 1. 

Algorithm 1 also improved on the results with “CUBA”. For examples that 
took longer than a few seconds, DrUBA was able to run in less time on the 
same benchmark. The difference on small examples is likely due to a different 
implementation language (C++ vs. Java). DrUBA does not explore as many 
schedules, and explores fewer as the delay and round bounds approach their 
cutoff values (as noted below). Additionally, DrUBA was less memory-intensive 
for large examples for which the CUBA approach cannot prove that the set of 
reachable states per context bound is finite. In this case, “CUBA” requires the 
use of more expensive symbolic representations of states sets. Algorithm 1 does 
not suffer from this problem—the reachability sets in each iteration are finite. 
For the Stefan-5 example, “CUBA” ran out of memory after 23min. DrUBA 
was able to prove convergence for this example (as was “AT” ). 

Table 3 reports the number of times Algorithm 1 computed the image (suc- 
cessors) of a state until reaching the final r-d-plateau (Col. 3) and during the final 
plateau (Col. 4), as well as the total number of image computations without the 
frontier optimization (Col. 5). The table offers convincing evidence to support 
our heuristic that waiting for a long d-plateau at the end of exploration is not 
costly, answering Question 3.. On most benchmarks, the amount of computation 
done during the plateau (Col. 4) was negligible. This included our largest exam- 
ple, Bluetooth3 (2+3). The exception to this is the Stefan examples, which—as 
mentioned earlier—do not require any delays to reach the full abstract state set 
(the d-plateau starts at (Tmax,0)). Finally, a naive implementation that does not 
take advantage of monotonicity, forgoing the frontier approach to expanding the 
state set, was orders of magnitude worse. This is because it has to recompute 
the whole set for every iteration of r or d. This answers Question 4.. 

Comparing Col. 7 in Table2 to the cutoff context-switch bounds from [18], 
we find that, while the r and d bounds were large, not all programs that needed 
large bounds took a long time to verify. For example, the Bluetooth3 (2+1) 
example took much less time than Stefan-5, despite requiring 21 more delays 
(with similar rounds). A hint for the reason can be found in Table3. Once the 
set of abstract states is close to the R, there are very few new states on the 
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Table 2. Benchmark description and running times for different algorithms. Threads: 
# of threads (a + b: the respective numbers of threads from two different templates); 
R: number of reachable abstract states; Time: running time (sec) for each algorithm 
(”—”: timeout or memory-out); Tmax, dmax: round and delay counts at the end of each 
plateau when convergence was detected. 


1 2 3 4 5 6 7 
Benchmark | Threads | |R| Time: Time: | Time: Trias Anax 
Algorithm 1 | “AI” “CUBA” | Algorithm 1 
1+1 1010 .69 .92 .32 23, 15 
1 Bluetooth] | 1+2 5468 3.32 6.79 2.25 32, 29 
2+1 18972 | 8.52 16.69 13.60 35, 26 
1+1 1018 .T1 .98 .29 23, 15 
2 Bluetooth2 |1+2 5468 |3.60 6.68 2.62 32, 29 
2+1 18972 |8.81 16.68 13.97 35, 26 
1+1 1018 2 1.23 Al 23, 15 
1+2 5468 |3.61 6.40 2.79 32, 29 
3 Bluetooth3 |2+1 19002 |9.97 16.27 |14.50 35, 26 
2+2 94335 |70.71 136.31 | 343.05 44, 40 
2+3 460684 | 654.47 2084.76 | TO 56, 56 
1+1 272 .36 .49 .14 31, 16 
4 BST-Insert | 2+1 6644 3.62 5.25 10.09 49, 32 
2+2 14256 | 8.12 14.87 | 99.94 50, 38 
5 Filecrawler | 1+2 246 37 .54 .05 20, 12 
6 | K-Induction | 1+1 130 .51 .74 48 20, 09 
7 Proc-2 2+2 352 .56 TT 2.05 19, 20 
2 31 .24 .36 .04 13, 02 
8 Stefan 4 687 13.99 13.82 | 20.33 32, 04 
5 3085 =| 428.22 295.02 | OOM 35, 05 
8 — OOM OOM |OOM — 
9 | Dekker 2 1507 | .82 1.62 39 37, 16 


frontier. We can see this in the small numbers in Col. 4, but it also applies to 
the round bound. If a state is rediscovered, it is not expanded in further round 
increments. Once the round bound is large enough, there are few deep schedules 
of maximum possible length (nr) that produce new concrete states. 


6.2 Unbounded-Thread Experiments 


We implemented Algorithm 1 in combination with predicate abstraction as 
detailed in Sect. 5.2 to check the effectiveness of our technique on a tricky con- 
current program that requires unbounded variable domains. The Ticket Lock 
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Table 3. Detailed analysis of Algorithm 1, measuring the number of times the program 
computed the successors of a state. Col. 3 reports the image operations Algorithm 1 
performed before reaching the FP (final plateau), Col. 4—the number of additional 
image operations computed until the program ended. Col. 5 shows the image operations 
without the frontier improvement, requiring recomputing each R(r, d) from the initial 
states. 


1 2 3 4 5 
Benchmark | Threads | Calls to Image | Calls to Image | Calls to Image 
— begin FP Begin — end FP | w/o frontier 

1+1 4,034 1 339,261 

1| Bluetoothi 1+2 23,441 3 4,758,084 
2+1 80,283 19 23,199,458 
1+1 4,103 1 350,587 

2| Bluetooth2 | 1+2 23,493 3 4,780,778 
2+1 80,714 19 23,290,556 
1+1 4,096 8 348,851 
1+2 23,493 3 4,786,950 

3 | Bluetooth3 2+1 80,834 19 23,467,470 
2+2 478,426 2 283,910,446 
2+3 2,766,625 6 = 
1+1 780 1 82,130 

4| BST-Insert 2+1 29,802 6 17,785,065 
2+2 62,190 25 34,335,106 

5| Filecrawler 1+2 1,056 4 202,074 

6 | K-Induction 1+1 5,636 974 218,715 

7 | Proc-2 2+2 2,501 1,298 578,099 
2 367 59 500,494 

8 | Stefan 4 658,696 261,881 = 
5 10,299,293 6,621,157 = 
8 = = — 

9 | Dekker 2 3,636 2 688,836 


protocol [3] and the predicates used to prove its correctness are shown in Fig. 2. 
In Line 0, threads wait to enter the Critical Section, whose code is at the begin- 
ning of Line 1; the rest of Line 1 is exit code to prepare the thread for re-entry. In 
the predicates on the right, subscript 7 denotes thread i’s copy of a local variable. 

This example has been shown to require significant adjustments to predicate 
abstraction to accommodate fixed-thread concurrency [8], and has been claimed 
to require an entirely new theory to cope with the unbounded-thread case [15]. 
We rely on the same predicates used in earlier work, and it is clear what motivates 
each predicate. P1 ensures t is a”’new” ticket larger than previous ones, P2 is 
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shared int s := 0, t :=0 

local int | := fetch_and_add (t) P1: Yi:t> l 
0: while s £1 do; > wait fors =1 . 

oy f P2: |{i: pc, = 1}| > 2 

1: critical-section code here % 

inel) P3: s=1 

i se ferch and adati P4: Vi:i Æ self :1AL; 

goto 0 


Fig. 2. Left: the Ticket Lock protocol; Right: four predicates used to prove it correct 


used to check the safety property, P3 tracks the condition in Line 0, and P4 
means that the operating thread’s l is unique. DrUBA finds four abstract states 
for both 2 threads and 3 threads using Algorithm 1. This is an n-plateau of 
length 1. 

To prove convergence for both Algorithm 1 and the “outer loop” increment- 
ing n, we used the ACL2s theorem prover [7]. We specified the data in a concrete 
state, and the four abstract states that were found. Only the second statement 
disrespects this abstraction w.r.t. r,d and n, as we know the value of the test in 
the first statement for an abstract state. Given these, ACL2s was able to verify 
that the set of abstract states is closed under the semantics of statement 1. As 
a result, we can report that Ticket Lock is safe (P2 is invariantly false), for an 
arbitrary number of threads and arbitrary thread interleavings. 


7 Discussion of Related Work 


This work is inspired from two angles. The first is clearly the delay-bounded 
scheduling (DBS) technique [10]. The authors formalize this concept and show its 
effectiveness as a testing scheme. Their computational model of a dynamic task 
buffer is somewhat different from ours. We have not discussed dynamic thread 
creation here; it can be simulated by creating threads up-front and delaying them 
until such time as they are supposed to come into existence. The DBS paper also 
presents a sequentialization technique that can be turned into a symbolic verifier 
via verification-condition generation and SMT solving. This, however, requires 
bounding loops and recursion. Our approach combines exhaustive finite-state 
model exploration with convergence detection and thus does not suffer from 
these restrictions. 

The second inspiration comes from an earlier context-unbounded analysis 
technique [18]. Similar in spirit to the present work, [18] started from a yet ear- 
lier context-bounded analysis technique and describes a condition under which 
a chosen context bound is sufficient to reach all states reachable under some 
abstraction. For the case of concurrent pushdown systems (CPDS)—the verifi- 
cation target of [18]—, the pop operation plays a crucial role in establishing this 
condition; note that, in our work, pop actions disrespect the top-of-the-stack 
abstraction commonly used for CPDS. 
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Our work has a number of advantages over [18]. First, and crucially, the 
set of states reachable under a context bound can be infinite (a single context 
can already generate infinitely many states); its determination thus requires 
more expensive symbolic reachability methods. In contrast, the reachability set 
under Round-Robin scheduling with a round- and a delay bound is always finite; 
moreover, it can be computed very easily, even for complex programs. This makes 
our technique a prime choice for lifting existing testing schemes to verifiers. 
A second advantage over [18] is that we retain much of the efficiency of the 
“almost deterministic” exploration delay-bounded scheduling, as demonstrated 
in Sect. 6. A downside of our work is that our convergence condition is sound only 
after a plateau has emerged of length roughly equal to the number of running 
threads; this is not required in [18]. However, as also demonstrated in Sect. 6, 
our efforts to compute reachable states for increasing r, d in a frontier-driven 
way nearly annihilates this drawback: in most cases, only a small number of 
image computations happen along the plateau. 

An alternative to our verification approach is a classical analysis based on 
abstract interpretation [6]. Given function a, such analysis interprets the entire 
program abstractly, and then computes a fixed point under the abstract pro- 
gram’s transition relation. This fixed point, if it exists, overapproximates the 
set of reachable abstract states. Hence, the absence of error states in the fixed 
point implies safety, but the presence of errors does not immediately permit 
a conclusion. In contrast, our technique interleaves concrete state space explo- 
ration (enabling genuine testing) with abstraction-based convergence detection. 
We believe this to be a useful approach in practical programming environments, 
where abstract proof engines with poorly understood bug-finding capabilities 
may be met with skepticism. A more detailed discussion of DrUBA vs. Abstract 
Interpretation can be found in the Appendix of [14]. 

Underapproximating program behaviors using bounding techniques is a wide- 
spread solution to address undecidability of safety verification problems. Exam- 
ples include depth- [12] and context-bounding [16,17,20], delay-bounding [10], 
bounded asynchrony [11], preemption-bounding [19], and phase-bounded anal- 
ysis [2,5]. Many of these bounding techniques admit decidable analysis prob- 
lems [16,17,20] and thus have been successfully used in practice for bug find- 
ing. Round- and delay-bounded Round-Robin scheduling trivially renders safety 
decidable, since the delay-program is finite-state. In addition, it is very easy to 
implement, avoiding, for example, the need for symbolic data structures and 
algorithms to represent and process intermediate reachability sets. 


8 Conclusion 


We have presented an approach to enhancing delay-bounded scheduling in asyn- 
chronous programs with a convergence test that, if successful, certifies that all 
states from some chosen abstract domain have been reached. The resulting algo- 
rithm inherits from earlier work the capability to detect bugs efficiently, but can 
also prove safety properties, under arbitrary thread interleavings. It exploits the 
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monotonicity of delay-bounded reachability sets to expand states and test for 
convergence only when needed. We have further demonstrated that, combined 
with predicate abstraction using powerful predicates, tricky unbounded-thread 
routines over unbounded data, such as the Ticket Lock, can be verified using 
substantially less machinery than proposed in earlier work. We have shown the 
experimental competitiveness of our approach against several related techniques. 
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Abstract. GPUs offer parallelism as a commodity, but they are diffi- 
cult to program correctly. Static analyzers that guarantee data-race free- 
dom (DRF) are essential to help programmers establish the correctness 
of their programs (kernels). However, existing approaches produce too 
many false alarms and struggle to handle larger programs. To address 
these limitations we formalize a novel compositional analysis for DRF, 
based on access memory protocols. These protocols are behavioral types 
that codify the way threads interact over shared memory. 

Our work includes fully mechanized proofs of our theoretical results, 
the first mechanized proofs in the field of DRF analysis for GPU kernels. 
Our theory is implemented in Faial, a tool that outperforms the state-of- 
the-art. Notably, it can correctly verify at least 1.42 more real-world 
kernels, and it exhibits a linear growth in 4 out of 5 experiments, while 
others grow exponentially in all 5 experiments. 


Keywords: GPU - Data-race - Static analysis - Behavioural types 


1 Introduction 


GPUs are massively parallel devices that promise a great return on investment 
at a cost: they are notably difficult to program. In GPU programming, hundreds 
of lightweight threads share portions of arrays in parallel (without locks)—very 
different from the programming model of multithreaded programs written in 
C or Java with heavy-weight heterogeneous threads. Data-race freedom (DRF) 
analysis aims to guarantee that for all possible executions, every array cell being 
written by one thread cannot be concurrently accessed by another thread. 

In the field of static analysis of DRF in GPU programs, there is a tension 
between efficiency and correctness (no missed data-races and no false alarms) 
that thus far is unresolved. Bug finding tools [26,27,33] favor correctness over 
efficiency: they provide correct results at small scales, by simulating the program 
execution. Such tools are incapable of handling certain parameters symbolically 
(e.g., array size) and can easily exhaust users’ resources (e.g., loops with long 
iteration spaces or unknown bounds). Approaches based on Hoare logic [5,7, 22] 
© The Author(s) 2021 


A. Silva and K. R. M. Leino (Eds.) CAV 2021, LNCS 12759, pp. 403-426, 2021. 
https: //doi.org/10.1007/978-3-030-81685-8_19 


404 T. Cogumbreiro et al. 


Inference Well-formed Barrier Barrier Quantification \ SMT backend 
check aligning splitting 
§5 §3 ge § 4.2 §5 §5 


Fig. 1. Work-flow of the verification. 


can cope with medium-sized programs, do not miss data-races, and do not require 
array size information; however, they suffer from a high-rate of false alarms and 
require code annotations written by concurrency experts. Finally, tools that can 
cope with larger programs and do not require array size information either miss 
data-races [24] or overwhelm the user with false alarms [37]. 

To appease this tension, we introduce a novel static DRF analysis that can 
handle larger programs and produce fewer false alarms than related work, with- 
out missing data-races. Additionally our analysis does not require code anno- 
tations or array size information. Our verification framework hinges on access 
memory protocols, a new family of behavioral types [1] that codify the way 
threads interact through shared memory. Our behavioral types also make evi- 
dent two aspects of the analysis that can be made separate: concurrency analysis 
(i.e., could these two expressions run in parallel?) and data-race conflict detec- 
tion (i.e., do these array indices match?). 


Contributions and Synopsis. This paper includes the following contributions. 


(1) In Sect. 3, we formalize the syntax, semantics, and well-formedness con- 
ditions for access memory protocols, which are behavioral types for GPU 
programs. This behavioral abstraction results in a simpler yet more expres- 
sive theory than previous works, e.g., it does not require user-provided loop 
invariants. 

(2) In Sect. 4, we show that our DRF analysis of access memory protocols can 
be soundly and completely reduced to the satisfiability of an SMT formula, 
see Theorems 1 and 3. Our theory and results on access memory protocols 
are fully mechanized in Coq. To the best of our knowledge, this is the first 
mechanized proof of correctness of a DRF analysis for GPU programs. 

(3) We show that our DRF analysis of access memory protocols is compositional 
when protocols satisfy a structural property, see Theorem 2. Additionally, 
we show how to transform protocols when they do not meet this property. 

(4) In Sect. 5 we present Faial, which infers access memory protocols from CUDA 
kernels and implements our theory. Our experiments show that Faial is more 
precise and scales better than existing tools. 

(5) In Sect.6, we present a thorough experimental evaluation of Faial against 
related work [5,24,26,27], the largest comparative study of GPU verifica- 
tion (5 tools in 260 kernels, 3 tools compared in 487 kernels). Faial verified 
218 out of 227 real-world kernels (at least 1.42 more than other tools) 
and correctly verified more handcrafted tests than other tools (4 out of 5). 
In a synthetic benchmark suite (250 kernels), Faial is the only tool to exhibit 
linear growth in 4 out of 5 experiments, while others grow exponentially in 
all 5 experiments. 
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Listing 2.1 Examples of racy kernels, 1.h.s. is from [34] and r.h.s. simplifies 1.h.s. 
for clarity, with one-dimensional array and thread identifier, and 1-stride loops. 


1 for (int r = 0; r < N; r++) { 1 for (int r = 0; r < N; r++) { 
2 for (int i = 0; i<TILE DIM; i+=BLOCK ROWS) 2 for (int i = 0; i<M; i++) 
3 { tile [tid .y+i][ tid .x] = idata [index in+i*width];} 3 {tile [tid] = ...;} 

4 syncthreads(); E 4 syncthreads(); 

5 for (int j = 0; j<TILE_DIM; j+=BLOCK_ROWS) 5 for (int j = 0; j<M; j++) 
6 { odata[index_out+jxheight] = tile [tid .x][ tid .y+j];}} 6 {... = tile [tid+j];}} 


Our paper is accompanied by an implementation (Faial), an evaluation frame- 
work (inc. datasets), and proof scripts (in Coq) for each theorem. All of these 
are available in our artifact [9]. 


2 Overview 


This section gives an overview of our approach by examining a data-race we 
found in published work [17,34]. We discuss the challenges that such examples 
pose to the state-of-the-art of DRF analysis. Then we introduce a verification 
framework based on access memory protocols: behavioral types [1] that codify 
the way threads interact via shared memory. Figure1 gives an overview of the 
verification pipeline. We start from CUDA kernels, from which we infer access 
memory protocols. Protocols are then checked for well-formedness and trans- 
formed in three steps into formulas that are verified by an SMT solver. 


2.1 Challenges of GPU Programming 


GPU Programming Model. The key component of GPU programming is 
the kernel program, or just kernel, that runs according to the Single-Instruction- 
Multiple-Thread (SIMT) execution model, where multiple threads run a single 
instruction concurrently. A kernel is parameterized by a special variable that 
holds a thread identifier, henceforth named tid. In parallel, each member of a 
group of threads runs an instantiated copy of the kernel by supplying its identifier 
as an argument. Threads communicate via shared memory (arrays) and mediate 
communication via barrier synchronization (an execution point where all threads 
must wait for each other before advancing further). Writes are only guaranteed 
to be visible to other threads after a barrier synchronization. 

GPU programming platforms usually group threads hierarchically in multi- 
ple levels, across which no inter-groups synchronization is possible. In both the 
literature [6,24] and this work, the focus is on intra-group communication. 


Challenges. We motivate the difficulty of analyzing data-races by studying a 
programming error found in the wild, reported in Listing 2.1 (left). This excerpt 
comes from a tutorial [34] on optimizing numeric algorithms for GPUs. The code 
listing transposes a matrix N-times with an outer loop indexed by variable r. 
Remarkably, the tutorial [34] does not inform the readers that Listing 2.1 
contains a subtle data-race: one transpose-operation starts (the writes to tile 
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Listing 2.2 Minimal representative example of an access memory protocol high- 
lighting the data-race in Listing 2.1. 


1 //r=0 

2 for’ j in 0..M_ // for (int j = 0; j<M; j++) 
3 {rd[tidt+j]}; // _ = tile [tid+i]; 

4 //r=1 

5 for” i in O..M  // for (int i = 0; i<M; i++) 
6  {wr[tid]} // tile [tid] = _; 


in line 3) without awaiting the termination of the previous transpose-operation 
(the reads from tile in line 6), thus corrupting the data over time and possibly 
skewing the timing of the optimization to appear faster than it should be. We 
found a related data-race in [17], which reuses code from [34]. 

Our tool, Faial, successfully identifies the program state that triggers the 
data-race in Listing 2.1: when r=1 and N=2. However, state-of-the-art tools strug- 
gle to accurately analyze Listing 2.1, as evaluated in Sect.6 (Claim 1: Test 1). 
Symbolic execution tools, such as [26,27], timeout for N>1, and, in general, can- 
not handle symbolic (unknown) bounds. GPUVerify [6], a tool based on Hoare 
logic, reports a false alarm: a spurious data-race when r=0 and N=1. PUG [24] 
incorrectly identifies the example as DRF, as its analysis appears to ignore data- 
races originating from different iterations of a loop. 


2.2 Memory Access Protocols by Example 


We now investigate the data-race in Listing 2.1 with an access memory proto- 
col. For presentation purposes, we focus our discussion on Listing 2.1 (r-h.s.), 
that simplifies the l.h.s. whilst retaining the root cause of its data-race, which 
stems from the interaction between both loops. We discuss how we support 
multi-dimensional arrays, multi-dimensional thread identifiers, and arbitrary 
loop strides in Sect. 5. In our Coq formalism the notion of “accesses” (and their 
dimensions) is a parameter of the theory, thus orthogonal to the theory presented 
here. 

Consider the execution of the end of the first iteration (r=0) and the beginning 
of the second (r=1) iteration of the outer-loop. In this case, the execution of the 
j-loop when r=0 is not synchronized with the execution of the i-loop when r=1 as 
there is no call to __syncthreads() in between. 

The access memory protocol in Listing 2.2 captures this partial execution 
from the viewpoint of array tile. By design access memory protocols over approx- 
imate kernels by abstracting away what data is being written to/read from an 
array, to focus on where data is being written. The protocol models the two prob- 
lematic loops of Listing 2.1, i.e., the j-loop when r=0 and the i-loop when r=1. 
The first loop reads (rd[tid+j]) from the array, while the second writes (wr[tid]) 
to it. Evaluation of a protocol follows the SIMT model: each thread evaluates 
wr|tid] by instantiating tid with their unique identifier (hereafter, an integer), 
e.g., thread 0 yields wr[0] and thread 1 yields wr[1]. 
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Analysis of Unsynchronized Protocols. We say that a protocol is DRF when 
all concurrent accesses are pair-wise DRF, i.e., when issued by different threads 
on the same index, then neither access is a write. For instance the respective 
sets of concurrent accesses of threads 0 and 1 in Listing 2.2 is given below 


tid = 0 tid =1 
{rd[j] | 0< 7 < M}U {wr[0]} DRF with? {rd[1+j]|0< 7 < M} U {wr[1]} 


When M>1, thread 0 (l.h.s) accesses rd[1] and thread 1 (r.h.s) accesses wr[1]. 
Thus, there is a data-race on index 1 of the array. 

A fundamental challenge of static DRF verification is how to handle loops. 
Symbolic execution approaches that unroll loops, e.g., [26,27], cannot handle 
large nor symbolic iteration spaces. Static approaches that use Hoare logic, e.g., 
[5,7,22], require user-provided loop invariants. Another approach is to reduce 
loops to verifying the satisfiability of a corresponding universally quantified 
formula, e.g., [25,30]. This has the advantage of being fast and not requiring 
invariants. However, its previous application to GPU programming, i.e., PUG, 
is unsound due to the interaction between barrier synchronizations and loops, 
e.g., PUG misses the data-race in Listing 2.1. We give more details in Sect. 6. 


Our Approach. A key contribution of our work is to identify conditions that allow 
a kernel to be reduced to a first-order logic formula, by precisely characterizing 
the effect of barrier synchronization in loops. To this end, the language of access 
memory protocols distinguishes syntactically between protocol fragments that 
synchronize from those that do not. For instance, the protocol in Listing 2.2 is 
identified as unsynchronized, as it does not include any synchronization. 

In Sect.4, we show that the DRF analysis of unsynchronized protocols can 
be precisely reduced to a first-order logic formula, where universally quantified 
formulae represent loops, thus obviating the need to unroll them explicitly. For 
instance, we reduce the verification of Listing 2.2 to asking whether for all M, 
tı, and t2, where tı Æ t2 are thread identifiers, the following holds: 


Yji, t1, J2, i2: 0 < ji <MNANOSia <S<MAOSja < MAOS < M => 
{rd[t + ja]} U {wr[ti]} DRF with? eril U wrie) 


This formula is unprovable since rd[t; + jı] races with wr[tz] when, e.g., tı = 0, 
t2 = 1, jı = 1, and M > 1. Hence, our technique flags Listing 2.2 as racy. 


Analysis of Synchronized Protocols. The protocol in Listing 2.3 (left) mod- 
els all the interactions over the shared array tile from Listing 2.1. This protocol 
consists of one outer loop r that contains two inner loops separated by a barrier 
synchronization (sync). The first inner loop writes (wr[tid]) to the array, while 
the second reads (rd[tid + j]) from the array. 
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Listing 2.3 access memory protocols (left) of array tile from Listing 2.1 and its 
aligned version (right). 


1 for” i in 0..M { wr[tid] } 
1 for in O..N { 2 sync; 
2 for’ i in 0..M { wr[tid] } 3 forr in 1..N { 
3 sync; aligns to 4 for’ j in 0..M { rd[tid + j] } 
a forj in 0..M { rd[tid + j] } 5 for’ i in 0..M { wr[tid] } 
5 } 6 | sync; } 
7 for’ j in 0..M { rd[tid + j] } 


This protocol illustrates how our language syntactically differentiates between 
protocols fragments that synchronize from those that do not. Concretely, our lan- 
guage precludes an unsynchronized loop (for x € n..m {u}) from calling sync any- 
where in u, and it requires that a synchronized loop (for « € n..m {p}) includes 
at least one occurrence of sync. The superscript U (resp. S) stands for synchronized 
(resp. unsynchronized). This distinction can be inferred automatically and yields 
a compositional analysis, as we explain below. 

The behavior of synchronized loops is difficult to analyse because they may 
contain data-races that span more than one iteration. For instance an instruction 
of iteration r in Listing 2.3 may race with an instruction of iteration r+1. 


Our Approach. In this work we show that the DRF analysis of synchronized proto- 
cols can safely be reduced to a first-order logic formula when such loops are aligned, 
i.e., when there is at least one synchronization exactly before the loop and one 
at the end of its body. In Sect. 4.1 we show how to transform an arbitrary access 
memory protocol into an aligned protocol using a syntax-driven transformation 
technique called barrier aligning. Intuitively, barrier aligning normalizes loops so 
that they do not “leak” accesses between iterations. The right-hand side of List- 
ing 2.3 shows the result of applying barrier aligning on the protocol from Listing 2.3 
(left). Observe that the fragment before the aligned loop (line 1) corresponds to the 
unsynchronized part of the original loop (before sync). The original loop itself is 
rearranged so that the part succeeding sync is moved to the beginning of the aligned 
loop (lines 3-6). The fragment following the aligned loop (line 7) corresponds to 
the unsynchronized loop that appears after the sync in the original protocol. 

In Sect.4.1 we show that aligned protocols enable compositional verifica- 
tion: protocol fragments between two barriers can be analyzed independently. 
This compositional analysis is possible because (i) there is no causality between 
instructions, except through sync and (ii) aligned protocols syntactically delimit 
the causality induced by sync. For instance, the aligned protocol in Listing 2.3 can 
be reduced to analyzing the following three protocol fragments independently: 


for” i € 0..M {wr[tid]} for” 7 € 0..M {rd[tid + j]} 
for r € 1..N {for j € 0..M {rd[tid + j]}; for” i € 0..M {wr[tid]}; sync} 


The first two protocols are handled like Listing 2.2 because they are unsynchro- 
nized. Representing a synchronized loop as a formula becomes possible when 
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the protocol is aligned: both threads must share the same value for r at each 
iteration. Hence, we reduce the verification to asking whether for all N, M, tı, 
and tz where tı Æ t2 and the following holds: 


YT, jiii, 52,42: LZ r<N AO < j1<MA0<i<MA0 < jo<MA0K<in<M 


=> {rdt +l} U fwrlti]} DRF with? {rd[te + jo} U {wr[te]} 


Our technique identifies Listing 2.3 as racy since this formula is unprovable, i.e., 
rd[tı+jı] races with wr[t2] when r = 1, tı = 0, t2 = 1, ji = 1, N > 1 and M >1. 


3 Access Memory Protocols 


An access memory protocol describes the interaction between a group of threads 
and a single shared-memory location. A protocol records where in memory 
accesses take place, but abstracts away from what data is read from/written 
to memory. The language of protocols distinguishes between an unsynchronized 
protocol fragment u € U, that disallows synchronization, from a synchronized 
fragment p € S that must include a synchronization. The syntax and seman- 
tics of access memory protocols is given in Figure 2. Our operational semantics 
is inspired by the synchronous, delayed semantics (SDV) from Betts et al. [6], 
where threads execute independently and communicate upon reaching a barrier. 
Hereafter, i, 7, k are metavariables over non-negative integers picked from the 
set N. An arithmetic expression n is either: an integer variable x, an integer i, 
or a binary operation on integers that yields an integer. A boolean expression b 
is either a boolean literal, an arithmetic comparison ©, or a propositional logic 
connective o. We write n | i when expression n evaluates to integer i, where 
evaluation is defined in the natural way. We overload the notation for Boolean 
expressions, e.g., b | true means that expression b evaluates to true. 


Unsynchronized Fragment. A protocol u € U either does nothing (skip), accesses 
a shared memory location oļi] (reads from/writes to index i), performs sequential 
composition, or loops. Figure 2 gives the semantics of unsynchronized protocols, 
which is parameterized by a set of thread identifiers 7 C N, where |T| > 2. 

Evaluation of an unsynchronized protocol u by a thread identifier 7, written 
u |i P, yields a phase, i.e., a set P € P of access values a € A. Each access 
value, or just access, notation 7:o[7], consists of its issuing thread identifier i, an 
access mode o (read/write), and an index j. Protocol skip produces no accesses. 
A memory access o[n] evaluates the index and creates a singleton phase. Sequenc- 
ing and looping are standard. Loop ranges include the lower bound and exclude 
the upper bound. Similarly to SDV, Rule U-par executes a copy of the unsyn- 
chronized code u for each thread 7 € 7 by replacing the special variable tid by 
the thread identifier, uftid := i], which results in the union of the accesses of all 
threads. To simplify the presentation we omit the unsynchronized conditionals, 
however they are included in our Coq formalism and are fully supported by Faial, 
see Sect. 5. 
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Syntax 

N>i z= Of 1]---: o wr | rd 
n = |i | nxn Ada x= tol] 

B>b := true | false | non | bob PaP = {a1,...,an} 
Uu = skip | ofn] | usu | for’ x € n..m {u} H a= || PHR 
S>Əp == sync | u | p;p | fo x€ n..m {p} 

Big-step semantics for U ul,P ul, S 
U-SKIP U-ACC U-SEQ U-FOR-1 

nj url, Pi uz |, Pe (n > m) | true 
skip |; Ø ofn] | ,{z:o[7]} u1 ; u2 |; Pi U Po for” zE n..m {u} |, 
U-FOR-2 


(n < m) | true ula := n] |; Pi for” x En +1..m {u} |, P2 


for” z € n..m {u} |, Pi U Pz 


U-PAR 
S =| _J{ultid = i l; Pi | iE T} 
uļr S 


History concatenation and merging 


[Pi ... Pa] [Pasa Pal = [Pi ... Pape] (H [P)) O ([P']: H^) = H - [P U P’). H 


H.H 


HOH 


p} H 


Big-step semantics for S 
S-SYNC S-PAR S-SEQ S-FOR-1 
ul,P p} H ql H’ (n > m) | true 
sync | [0,0] ul [P] piq HoH for? « € n..m {p} | [0] 
S-FOR-2 


(n < m) | true pila := n] | H for’ a En+1..m {p}| H’ 


for? z € n..m {p}| H O H’ 


Structurally well-formed protocols swf (p) 
suf(p) suf (a) suf(p) tid ¢ fo(n) U fom) 
swf (u ; sync) to 5 
suf (p;q) suf (u1 ; for? x € n..m {p;u2}) 
Data-race, safe phase, and safe history a#t B safe(P) safe(H) 
wr € {o,0'} iAg Va,38€P:7A(a# 8) VP € H: safe(P) 
i:o[k] # j:0 [K] safe(P) safe(H) 


Fig. 2. Syntax, semantics, and properties of access memory protocols. 
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Synchronized Fragment. A protocol p € S may perform barrier synchroniza- 
tion sync, run unsynchronized code u, perform sequential composition, and loop. 
Figure 2 gives the semantics of a protocol, notation p | H. Evaluation of a pro- 
tocol p yields a history (ranged over by H), i.e., a list of phases (P) that records 
how memory was accessed. We use :: as list constructor and - for the usual list 
concatenation operator. Histories are merged using the special ©-operator. 

A barrier synchronization creates two empty phases, corresponding to phases 
before and after synchronization. Running an unsynchronized protocol yields a 
single phase containing all accesses performed by that protocol. Sequencing two 
synchronized protocols p with q merges the last phase of the former with the 
first phase of the latter, as these two phases run concurrently. The base case of 
a synchronized loop produces a singleton history containing the empty phase. 
Running one iteration of a synchronized loop sequences the history of the first 
iteration with the rest of the loop, by merging the two histories. 

Next, we introduce the notion of well-formed protocols, a restriction of 
structurally well-formed protocols, see swf(p) in Figure 2. We discuss how well- 
formedness is enforced in Sect. 5. We write fu(p) (resp. fu(n)) for the free vari- 
ables of p (resp. n). 


Definition 1 (Well-formed protocol, p € W). We say that a protocol is well- 
formed, notation p E€ W, when suf(p), fu(p) C {tid}, and every synchronized 
loop executes at least one iteration. 


DRF is formalized at the bottom of Figure 2. Two accesses are in a data-race 
(or racy) when there exist two different threads that access the same index k, 
and one of these accesses is a write. Our definition does not distinguish between 
harmful and benign data races, a data-race in which both threads write the same 
value. Phase P is safe iff each pair of accesses it contains is not racy. History P is 
safe when all of its phases are safe. We say that p is DRF iff p | H and safe(H). 


4 DREF-Preserving Transformations of Protocols 


This section presents the main steps of the DRF analysis summarized in Figure 1: 
barrier aligning (Sect. 4.1) and splitting (Sect. 4.2). 

This section also includes our key theoretical results. We establish that these 
steps preserve and reflect data-races (7.e., any and all data-races are found), see 
Theorem 1 and Theorem 3. We make precise the notion of compositionality that 
makes our approach scalable in Theorem 2. 


4.1 Aligning Protocols 


The first transformation step normalizes protocols by aligning synchronized 
loops, which in turn enables a form of compositional verification. The goal of the 
transformation is to produce protocols which belong to A, see top of Figure 3. 
Barrier aligning (or just aligning) is performed by function align, given in 
the bottom half of Figure 3. The function returns a pair whose first element is an 
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Aligned protocols pEA 


pEA qEA pEA qEA 
p;qEA pifor ze n.m {qE A 


u;sync E€ A 


Sequencing aligned protocols zU—=>A—=>A 3: (AxU)—=>(AxU)=> AxU 


ug(u’;sync) =(u;u');sync = ug(p3q)=(usp);q (p, u)g(q, u’) = (p;(ugaq), u’) 


Aligning protocols align: W => Ax U 
align(u ; sync) = (u ; sync, skip) align(p;q) = align(p) 8 align (q) 
align(p) = (q, u3) qn =uwșgqle:=n] u=uz;uz 


align(u1 ; for? z € n..m {p;u2}) = (qj for’ x € n+1..m {ula = x—1] 3q}, ulz = m—1]) 


Fig. 3. Aligning protocols. 


aligned and synchronized protocol, and whose second element is an unsynchro- 
nized protocol. Intuitively, the pair represents a sequence which we delimitate 
syntactically. We note that the output of align, say (q, u), can be trivially made 
into an aligned protocol: q; u; sync. The case for synchronization is simple, align 
returns the input protocol as the first component of the pair and skip as the 
second component (the input protocol is already fully aligned). The case for 
sequence consists of the sequential composition of the pair aligned with unsyn- 
chronized code using operator (3). Sequencing two pairs (p, u) 3 (q, u’) amounts 
to sequencing u to the outer-most piece of unsynchronized code present in q. 

Dealing with synchronized loops is more involved. Given a loop uz; for’ x € 
n..m {p; u2}, we produce a protocol consisting of the fragment preceding the 
loop and the synchronized part of its first iteration (q1), an aligned loop starting 
at n+1, and the unsynchronized part of its last iteration (u[z := m—1]). See 
Listing 2.3 for an example of protocol aligning. We note that we can always unroll 
the loop because the analysis only considers non-empty synchronized loops; we 
discuss how to enforce this assumption in Sect. 5. 

We now state two fundamental properties of barrier aligning: preserving 
and reflecting DRF (Theorem 1), and enabling compositional verification (The- 
orem 2). Theorem1 states that verifying DRF of a well-formed protocol p is 
equivalent to verifying DRF of its aligned counterpart. 


Theorem 1 (Correctness of Align). If p € W and align(p) = (q, u), then 
p is DRF if and only if q;u is DRF. 


To state our compositionality result, we introduce a language of contexts: 


C = [_] | ussync | pC | C;p | C;for8 z€ n..m {p} | p;for’ x € n..m {C} 
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Syntax 
Lh ::= skip | niolm] | h;h | var z in n.m; h 
Product of histories H&H 
Hy 9 Ho = [Pi U Ps | (Pi, Po) € Hi x Ho] 
Big-step semantics h} H 
nli mij hi 4 Hı hə J) He (n > m) | true 
skip 4 [0] n:o[m] 4 [{izolz] }] hı ; h2 4} Hı ® He var x in n..m; h 4 [Ø] 
(n < m) | true h|x := n] 4 Hı var x in n + 1..m; h | He 
var x in n..m; h 4} Hı - He 
Projection trace: U = L 
trace(o[n]) = tid:o[n] trace(for’ a € n..m {u}) = var x in n..m; trace(u) 
trace(u1 ; u2) = trace(ur) ; trace(u2) trace(skip) = skip 
Splitting protocols split: A — [£] 


split(p ;q) = split(p) - split(q) 


t1,t2 fresh hı = trace(u)[tid := ti] h2 = trace(u)|[tid := te] 
split(u ; sync) = [var tı in 1..|7]; var t2 in 0..t1; hi ; hal 


split(p; for? x € n..m {q}) = split(p) - [var x in n..m;h | h € split(q)| 


Fig. 4. Syntax and semantics of symbolic traces, and splitting of protocols. 


The base cases correspond to a hole | _ ] or an unsynchronized protocol (followed 
by sync). The other cases follow the structure of access memory protocols. 


Theorem 2 (Compositionality). LetC be a context, s.t. C[skip; sync] is DRF. 
For all p € A, if p is DRF, fu(p) C {tid}, then Clp] € A and C[p] is also DRF. 


Compositionality allows Faial to analyze each fragment of an aligned protocol 
independently, by splitting the given protocol into multiple symbolic traces. 


4.2 Splitting Protocols into Symbolic Traces 


The second verification step, splitting, consists in transforming an aligned proto- 
col into symbolic traces, i.e., symbolic representations of sets of memory accesses 
which occur between two synchronizations. 
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Symbolic Traces. Intuitively, symbolic traces are a thin abstraction over an SMT 
formula. We describe how to translate a symbolic trace to a formula in Sect. 5. 

We give the syntax and semantics of symbolic traces in Figure 4. Expres- 
sion skip terminates a trace. Expression n:o[m] states that thread n accesses 
index m with mode o. Expression h1; h2 composes two symbolic traces using 
operator ©, also given in Figure 4. Expression var x in n..m;h binds variable x 
in h, where variable x is an integer in the range induced from n and m. The 
semantics of a symbolic trace yields a history with a phase for each possible vari- 
able assignment. Expression skip yields a single empty phase. Expression n:o[m|] 
evaluates to a singleton set that contains the access value that results from eval- 
uating the thread-identifier expression n and the index expression m. Sequencing 
histories h1; hı consists of performing the product of phases (operator ®), which 
consists of merging every phase of H, with every phase of Hp. A variable binder 
behaves like a skip when the range of values is empty. Otherwise, we fork two his- 
tories Hı and Hə. We assign the lower bound of the set in Hı, and we recursively 
evaluate a variable binder where we increment its lower bound in H3. 


Barrier splitting is the transformation from aligned protocols to symbolic traces, 
performed via functions trace and split, defined in Figure4. Function trace 
extracts the symbolic trace of an unsynchronized program for a single thread. 
Memory accesses are tagged with the owner thread tid, and unsynchronized 
loops are converted into variable bindings. Function split returns a list of sym- 
bolic traces. The case for p; q is trivial (operator - stands for list concatenation). 
The base case of split is for unsynchronized protocol fragment u, which produces 
a list containing a single symbolic trace. It introduces fresh variables tı and tg 
that represent two (distinct) symbolic thread identifiers. The rest of the trace 
consists of the trace of u instantiated to the first thread identifier tı followed 
by its instantiation to the second thread identifier t2. The case for synchronized 
loops simply reinterprets the loop as a variable binder. Function split leads to an 
exponential blow up wrt. nesting of synchronized loops, but this has not posed 
problems in practice, c.f., Claim 2. 


Example 1. Let p= wr[tid + 1]; rd[tid + 2]; sync. We have that split(p) returns: 
var ty in 1../T|; var tg in 0..t1; tu:wr[ty+]]; ti:rd[t1 +2]; to:wr[ta+1]; ta:rd[t2+2] 


We show that barrier splitting preserves and reflects DRF. 


Theorem 3. Let p € A, such that p | Hy, and Hy = |H | h € split(p)Ah 4 H], 
then safe(H,) if and only if safe( H2). 


Hence we have established that aligning (Theorem 1) and splitting (Theo- 
rem 3) preserve and reflect data-races, i.e., any and all data-races are found. 
Thus, the only source of approximation in our analysis stems from the inference 
of protocols from CUDA kernels, which we discuss in the next section. Theorem 3 
highlights the compositionality of our analysis, as each symbolic trace resulting 
from function split can be analyzed independently. 
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5 Implementation 


In this section we present our tool, Faial, that implements the steps described 
in Figure 1. Faial takes a CUDA kernel as input and produces results that either 
identify the kernel as DRF or list specific data-races. In this section, we describe 
the implementation of the protocol inference, well-formedness checks, and trans- 
formation to SMT. 


Inference. This step transforms a CUDA kernel into access memory protocols 
(one for each shared array). It uses 1ibclang [23] to parse the kernel, a standard 
single static assignment (SSA) transformation to simplify the analysis of indices 
and arrays, and code slicing to only retain code related to shared array accesses. 
We note that Faial supports constructs of the CUDA programming model that 
are not directly modeled by access memory protocols, e.g., unstructured loops, 
conditionals, function calls, and multi-dimensional arrays. To support multi- 
dimensional thread identifiers, we extend the language of protocols to support 
multiple thread identifiers, and adapt function split accordingly. The main chal- 
lenges are related to loops and function calls. 

Whenever possible loops are transformed to loops with a stride of 1 fol- 
lowing ideas from loop normalization [24] and abstraction [30]. For instance, 
in for(int i=lb;i<ub;it=s){S} we change the stride from s into 1 by exe- 
cuting the loop body S when the loop variable i is divisible by stride, 
ie., the loop becomes for(int i=lb;i<ub;i++) if((i+lb)%s==0){S}. Similarly, a 
loop ranging over powers of n, e.g., for(int i=lb;i<ub;it=s), becomes for(int 
i=lb;i<ub;i++) if(powerof(i,s)){S}, where function powerof(i,s) tests whether i is 
a power of base s. We approximate whiles as a structured loop with an unknown 
upper bound. 

Function calls that manipulate shared memory are uncommon in GPU pro- 
gramming. Additionally auxiliary functions that manipulate shared memory 
have a compiler annotation to inline their bodies, hence we can inline such calls 
easily. Faial cannot handle recursive functions, but these rarely occur in practice. 
Function calls that do not access shared memory are simply discarded. 


Well-Formedness. This step ensures that kernels Faial analyzes meet the well- 
formedness conditions, i.e., p € W, including the assumptions that synchronized 
loops iterate at least once, see Definition 1. First, Faial annotates loops with a 
synchronized/unsynchronized tag according to the presence of sync in the loop 
body, then adjusts the precedence of sequencing to group all unsynchronized code 
preceding a sync or a synchronized loops. Synchronized loops of well-formed pro- 
tocols cannot manipulate thread-local variables (i.e., tid), an assumption shared 
by the CUDA programming model. Hence, Faial flags such kernels as erroneous. 
Next, Faial adds assertions before/after synchronized loops to check that the 
loop range is non-empty, i.e., loops execute at least once. Similarly to loops, 
conditionals are tagged as synchronized or unsynchronized. Then, Faial inlines 
synchronized conditionals, i.e., when a synchronized conditional is found, two 
copies of the input program are created and each copy is prefixed by a global 
assertion corresponding to the condition. Faial does not support synchronized 
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conditionals that appear within synchronized loops. We have not found real- 
world kernels that include such a construction. 


Quantification. This step transforms each symbolic trace (Figure 4) into an SMT 
formula, to check for safety, c.f., Figure2. The presented formalism assumes a 
decidable fragment. However, as CUDA programs may include multiplication in 
index expressions, Faial uses an undecidable logic (SMTLib’s QF_ LIA). Essen- 
tially, the generated formula guarantees that the indices of array accesses are 
distinct when there is at least one write. We illustrate this straightforward trans- 
formation with Example 2. 


Example 2. The formula generated from the trace in Example 1 is given below: 
Vti,ta:1<ti < |T/A0<te < t1 A (mı = wr V m2 = wr) => 


((idx1 =ty +1Am = wr) V (idx, = tı +2A my = rd)) 
A ((idx2 =tg+1A m2 = wr) V (idx2 = t2 +2A m2 = rd)) A idx, Æ idx2 


where each symbolic access is translated to a conjunction representing its index 
(idx) and access mode (m). Observe that the formula enforces that indices idx, 
and idx. (executed by distinct threads) are different. 


For multi-dimensional arrays, we generate one pair of indices per dimension, and 
check that at least one pair is distinct. 


6 Experimental Evaluation 


We evaluate Faial over several datasets and show how it fares against existing 
approaches. We structure this evaluation in three claims. 


Claim 1: Correctness. We claim that our approach finds more bugs and raises 
fewer false alarms than existing tools. To evaluate this claim, we compare 
Faial against four state-of-the-art kernel verification tools over 10 kernels that 
are known to be tricky to analyze. 

Claim 2: Scalability. We claim that our approach scales better to larger pro- 
grams. To evaluate this claim, we compare Faial against other tools over a set 
of synthetic benchmarks designed to test the limits of each tool, in terms of 
run time and memory usage. 

Claim 3: Real-world usability. We claim that our approach is more usable 
than existing static verification tools on real-world CUDA programs. To evalu- 
ate this claim, we use a varied dataset of real-world DRF kernels and measure 
the false alarm rate, run time, and memory usage of Faial, GPUVerify, and 


PUG. 


Benchmarking Environment. To make our evaluation reproducible, we developed 
a benchmarking framework to automate our experiments over the different tools 
and datasets. For Claim 1 and Claim 3, we designed a tool-agnostic file format for 
kernel functions and associated metadata (e.g., expected result of DRF analysis, 
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Table 1. Results for Claim 1. DRF indicates that a (static analysis) tool reported 
a test case as DRF. NRR indicates that a (symbolic execution) tool did not report 
any data-race. Label x/y indicates that the tool reported y data-races, x of which are 
actual races. Label timeout indicates that the tool did not terminate within 90s. A test 
passes if the tool returns the expected result and all reported races are valid. 


Test Expected Faial GPUVerify PUG GKLEE SESA 
1 transposeDiagonal Racy 1/1 0/2 DRF timeout timeout 
DRF DRF 0/1 DRF timeout timeout 
Jtirst-iter Racy 1/1 0/1 1/1 timeout timeout 
DRF DRF 0/1 0/1 timeout timeout 
aia. dake Racy 1/1 1/1 0/1 timeout timeout 
DRF DRF 0/1 DRF timeout timeout 
Á isce eee easier Racy 1/1 0/1 0/1 timeout timeout 
DRF DRF 0/1 0/1 timeout timeout 
b tead-indėz Racy 0/1 1/1 0/1 NRR NRR 
DRF 0/1 DRF 0/1 NRR NRR 
Number of tests passed (of 5): 4 1 0 0 0 


grid and block dimensions, and include directives). And for Claim 2, we created 
a tool that generates kernels according to given templates, e.g., see Figure 7. 

We evaluate Faial against the following verification tools: GPUVerify [5] v2018- 
03-22; PUG [24] v0.2; and, GKLEE [26] and SESA [27] v3.0. Experiments for Claim 1 
use an Inteli5-6500 CPU, 7.7 GB RAM, and Fedora 33 OS, while Claim 2 and Claim 3 
use an Intel i7-10510U CPU, 16 GB RAM, and Pop! OS. 


Excluded Tools. We excluded ESBMC-GPU [33] and Simulee [37] from the evalu- 
ation because we were unable to get them to run satisfactorily. Both tools have 
rudimentary support for verifying arbitrary CUDA kernels. ESBMC-GPU did not 
find a single data-race in our benchmarks, while Simulee produced false alarms 
for every DRF-kernel given. 


Claim 1: Correctness 


We have selected a set of tricky kernels to expose false alarms and missed data- 
races in Faial, GPUVerify, PUG, GKLEE, and SESA. Our results are reported 
in Table 1. The dataset consists of 5 tests, each consisting of two variations 
of the same kernel: one racy and one DRF. The racy version of Test 1 (c.f, 
Listing 2.1) contains an inter-iteration data-races. The DRF version adds a sync 
after the second inner loop. Tests 2 to 4 expose various loop-related data-races. 
Their protocols are given in Figure5. In the racy version of Test 2 wr[tid + 1] 
conflicts with wr[tid] of the first iteration. Similarly, in the racy version of Test 3, 
wr{tid + 1] of the last iteration races with wrftid]. In the racy version of Test 4 the 
last iteration of a nested loop races with the first iteration of the following loop. 
Test 5 exposes the abstraction gap between kernel and access memory protocols 
(which abstract away array elements), see Figure 6. 

Faial passes more tests than any other tool. Failed Test 5 is caused by access 
memory protocols abstracting away from what data is being read from/written 
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// first-iter // last-iter // \ast-iter-first-iter 
wr[tid +1]; for x in 0..N { for? x in 1..N+1 { 
for? x in 0..N { sync; for y in 1..x+1 { 
if (x > 0) if (tid < |T|-1) sync; wr[tid+xty]}}; 
. , for z in N*2..N*3 { 
{wr[tid] } ; {wr[tid+1] } }; wr[tid+z2 i]; sync} 
sync} wr[tid + |T|] 


Fig. 5. Protocols for Tests 2 to 4, c.f., Claim 1, where N is a free thread-global variable. 
Yellow shaded code only appears in the DRF version of first-iter and last-iter. Red 
shaded code only appears in the racy version of last-iter-first-iter (Color figure online). 


// Racy kernel 


A[tid] = tid; 
int x = A[tid ]; 
A[x+1] = 0; 


// Protocol A 
wr[tid ]; 
rd[tid ]; 
wr[xt+1] 


// DRF kernel 


A[tid] = tid; 
int x = A[tid ]; 
A[x] = 0; 


// Protocol A 
wr[tid ]; 
rd[tid ]; 

wr[x] 


Fig. 6. Kernels and protocols for Test 5 (read-index), c.f., Claim 1; x becomes a free 
thread-local variable as protocols do not model array elements. 


to arrays, i.e., array elements. In each case, Faial reports one spurious data 
race (0/1). We report on performance trade-offs wrt. tracking array elements in 
Claim 2. 

GPUVerify passes Test 5 because it tracks array elements, but fails the remain- 
ing 4 tests. Some reported false alarms are ill-formed, e.g., on the racy component 
of Test 2, the report (0: wr[tid]; 16 : wr[tid]) has disjoint indices. 

PUG obtains the worst score amongst static tools. Notably, the tool misses a 
data-race in Test 1, demonstrating its unsoundness, c.f., Sect. 2.1. 

GKLEE and SESA timeout for tests that include loops, as the loop bounds 
are unknown.Both tools miss the data-race in Test 5. Symbolic tools may be able 
to report data-races when the bound is known, e.g., timeouts start in Test 1 when 
the bound is at least 2, in Test 2 when the bound is at least 23, 000. 


Claim 2: Scalability 


We evaluate the scalability of our approach with a synthetic dataset that aims 
at demonstrating how different kernel constructs affect run time and memory 
usage of Faial, GKLEE, GPUVerify, PUG, and SESA. Our dataset is divided into 
five categories, one per syntactical construct in the language of access memory 
protocols, as well as conditionals, which are supported by our inference step, 
c.f., Sect. 5. Figure 7 shows the protocols of the kernel patterns we generate in 
each category: (i) repeated accesses (read then write), (ii) repeated barrier syn- 
chronizations separated by writes, (iii) repeated conditionals, (iv) increasingly 
nested unsynchronized loops, and (v) increasingly nested synchronized loops. In 
each category, we vary the problem size by repeating a pattern from 1 to 50 
times. Note that all kernels generated this way are DRF. 
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// accesses // barriers | // conditionals}| // unsynchronized loops | // synchronized loops 
rd[tid + ni*|T |]; | wr[tid ]; if tid=-0 for’ iz in O..N { for ia in 0..N { 
ol he i a ca wa I} wr[tid ]; wr[tid ]; sync; 
rd[tid + no*|T |]; wr[tid ]; if tid==1 for? isin O..N{ for? is in O..N{ 
wr[tid + 2*|T |]; sync; {wr[tid ]}; iad 1; we tid ]; syne; 
apes ae oe // 3 // » 


Fig. 7. Synthetic protocols generated for Claim 2. N is a free thread-global variable, 
and ni, Ng... are positive integer literals. 


Figure8 shows the average run time and memory usage over five runs on 
logarithmic and linear scales, respectively. For each run, we set a timeout of 90s 
and we exclude any run that times out or reports a false alarm. Cutoffs in the 
memory plots are determined by the cutoffs in the run time plots. 

Overall Faial is the most scalable tool. In 4 out of 5 categories, Faial has the 
slowest growth for all experiments, and verifies all tests within 0.46s. In the 
largest problem sizes, our tool is the fastest in 3 categories (access, conditional, 
unsynchronized loop), 2”4 for barriers, and 3'¢ for synchronized loops. Overall, 
the memory usage of Faial is competitive with other tools. Faial is the only tool 
with a near constant time/memory for up to 50 unsynchronized loops, indicating 
the scalability of reducing unsynchronized loops to universally quantified formu- 
las. Faial only times out for kernels which consists of >17 nested synchronized 
loops. However such kernels are uncommon, e.g., the levels of nested synchro- 
nized loops in the real-word kernels studied in Claim 3 are at most 3. 

GPUVerify remains stable in the barrier and conditional categories but is 
affected negatively by loops and accesses. Loops are a known bottleneck in 
GPUVerify [2]. In the access category there is an exponential slowdown due to 
GPUVerify keeping track of what data is being written to/read from array. 

PUG tool remains stable with the number of barrier synchronizations but is 
affected negatively by the number of conditionals and loops. PUG is the fastest 
tool with smaller inputs, but it raises false alarms in the access category, hence 
these measurements are omitted from the corresponding plots. 

We discuss GKLEE and SESA together since SESA processes GKLEE’s NVCC 
byte code output by concretizing variables, before passing it to GKLEE itself. 
There are two main factors that affect negatively these symbolic execution tools: 
(i) the number of loops, since they unroll each loop; and (ii) the amount of 
bookkeeping required to keep track of what is read from/written to memory. 
Figure 8 shows clear exponential curves for the access and barrier synchronization 
categories. Observe that these tools timeout immediately in the loop categories. 


Claim 3: Real-World Usability 


We evaluate the usability of our approach by comparing Faial with other static 
verification tools (GPUVerify and PUG) on real-world kernels wrt. rate of false 
alarm and run time. We curated a set of CUDA kernels from [2], which consists 
of 3 benchmark suites (totaling 227 CUDA kernels): NVIDIA GPU Computing 
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Fig. 8. Results for Claim 2. Run time (left plots) are given on a logarithmic scale, 
and memory (right plots) are given on a linear scale. Flatter and lower curve is better. 
Tools annotated with a triangle are excluded due to timeouts or errors. 


SDK v2.0 (8 CUDA kernels); NVIDIA GPU Computing SDK v5.0 (166 CUDA 
kernels); Microsoft C++ AMP Sample Projects (20 kernels); gpgpu-sim bench- 
marks (33 kernels). All kernels are DRF and have been pre-processed by the 
authors of [2] to facilitate verification. Each kernel is in a distinct file, all depen- 
dencies are available, and kernels are annotated with minimal pre-conditions to 
allow for automatic analysis (e.g., thread count is given). 
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Fig. 9. Results for Claim 3, on a set of 227 DRF CUDA kernels. 


As we aim to evaluate fully automatic verification of three tools, we removed 
code annotations (pre-conditions and loop invariants) specific to GPUVerify. 
Additionally, we made minor changes to some kernels to meet the limitations of 
the front-end of Faial and PUG. For instance we converted nested array lookups 
to use temporary variables and inlined functions calls that operate on arrays in 
22 kernels. Another 8 kernels were modified to simplify their control flows. Our 
curated dataset will be included in our artifact submission. 

Figures 9a, b, and c give the correctness results of Faial, GPUVerify, and PUG, 
respectively. Correct refers to the true-positive rate, i.e., when the tool correctly 
identifies the kernel as DRF. False Alarm refers to the false alarm rate, i.e., when 
the tool incorrectly identifies the kernel as racy. A kernel is Unsupported if it 
makes the tool crash. A Timeout occurs when the tool exceeds the limit of 60s 
to verify a kernel. The values shown are an average calculated over five runs. 
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Figure 9d shows the average run time and memory usage of every true-positive 
report (we omit invalid reports) across the three tools. 

Overall Faial has the highest rate of true-positives at 96%. Our tool is second 
in terms of run time and memory usage, showing a good compromise w.r.t. time 
and space. Faial verifies most kernels within 1s, and all kernels that need more 
time are only verified by Faial. GPUVerify shows lower memory usage at the 
cost of a higher verification run time. PUG verifies the lowest number of kernels 
(34.8%), as most kernels are unsupported (62.6%). 


7 Related Work 


SMT-Based DRF Analyses. Li and Gopalakrishnan propose a direct encoding of 
DRF analysis of GPU programs in SMT, with PUG [24,25]. Both PUG and Faial 
follow a similar approach of barrier splitting: having a symbolic representation of 
a canonical interleaving, and dividing up the analysis over barrier intervals. The 
two major distinctions are that (1) PUG misses inter-thread data-races in syn- 
chronized loops, e.g., Listing 2.1, and (2) the algorithms of PUG are unspecified 
and lack soundness proofs. In [24, Sect. 6.3] the authors identify the challenge of 
detecting inter-thread data-races, but do not elaborate a solution. Ma et al. [30] 
present a similar technique to detect data-races and deadlocks in OpenMP pro- 
grams (CPU-based parallelism). However, their work does not guarantee DRF, 
and they do not formalize their algorithms. In [8], Prasanth et al. propose a 
polyhedral encoding of DRF for OpenMP programs, which is only applicable to 
programs with affine array accesses. However the prevalence of linearized array 
expressions in GPU kernels is known to stump polyhedral analysis [16]. 


Hoare-Logic-Based DRF Analyses. The main drawback of Hoare-logic based 
tools is their high rate of false alarms. They also require code annotations from 
a concurrency expert to handle loops. GPUVerify [2,3,5,6, 12] can verify CUDA 
and OpenCL kernels using Boogie [4] as a backend. GPUVerify also relies on 
a two-thread abstraction (pen and paper proof)—in this paper, we present the 
first machine-checked proof of the two-thread abstraction idea. VeriCUDA [20, 21] 
focuses on reasoning about the functional correctness of GPU programs using 
Hoare-logic. In [22] the authors extend VeriCUDA to proving DRF. In a sim- 
ilar vein, VerCors [7] uses separation logic to prove the functional correctness 
and DRF of GPU kernels. Both VeriCUDA and VerCors expect a tool-specific 
language, hence cannot handle real-world kernels directly. 


Data-Race Finders. These include dynamic data-race detection, symbolic- 
execution, and model-checking. Such techniques are better suited for highly 
detailed analysis in smaller kernels, and typically are unable to prove DRF. 
Dynamic data-race detection executes a kernel to find data-races on a fixed 
input, e.g., [14, 18, 19, 28,32,38,39]. This technique only reports real data-races, 
but suffers from a slowdown of at least 10x compared to the non-instrumented 
program, and requires the kernel input data, which might be unavailable or 
unknown. Symbolic execution and model checking have been extended to detect 
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data-races [10,11,26,33,37]. These techniques do without the kernel input data 
and can detect more data-races than dynamic data-race detection. 


Miscellaneous. Ferrel et al. introduce a machine-checked formalism to reason 
about the semantics of CUDA assembly [15]. Dabrowski et al. mechanize the 
DRF-analysis of multithreaded programs [13]. Muller and Hoffmann present a 
logic to reason about the evaluation cost of CUDA kernels [31]. 

Other behavioral types have been used to verify parallel and multithreaded 
systems that communicate via message-passing [29, 35,36]. However these do not 
capture shared memory (only message-passing), thus cannot address data-races. 


8 Conclusion 


We tackle the problem of statically checking DRF in GPU kernels, with a new 
family of behavioral types, i.e., access memory protocols. We provide a novel 
compositional analysis of access memory protocols, along with fully mechanized 
proofs and an implementation. Our evaluation explores challenging and diverse 
benchmarks (229 real-world and 258 synthetic kernels) to demonstrate that our 
approach is more precise (false alarms and missed alarms), scalable (time/mem- 
ory growth), and usable (real-world kernels correctly verified) than other tools. 


Acknowledgements. We thank Rumyana Neykova, Stephen Chang, and the anony- 
mous reviewers for their insightful feedback on earlier versions of this work. 
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Abstract. GENMC is an LLVM-based state-of-the-art stateless model 
checker for concurrent C/C++ programs. Its modular infrastructure 
allows it to support complex memory models, such as RC11 and IMM, 
and makes it easy to extend to support further axiomatic memory 
models. 

In this paper, we discuss the overall architecture of the tool and how 
it can be extended to support additional memory models, programming 
languages, and/or synchronization primitives. To demonstrate the point, 
we have extended the tool with support for the Linux kernel memory 
model (LKMM), synchronization barriers, POSIX I/O system calls, and 
better error detection capabilities. 


1 Introduction 


For any software developer or verification engineer, it is no news that concurrent 
programming is difficult, that concurrent software is often buggy, and that there- 
fore verification of concurrent programs has attracted a lot of research interest. 
Within the verification community at least, it is also common knowledge that 
verification of concurrent programs is challenging because of the huge number 
of interleavings of the threads comprising a concurrent program. 

What has changed in the last decade, however, is the importance of weak 
memory consistency [6,11,13,14,21,25,32,36,40,41] as a key factor contribut- 
ing to the complexity of concurrent programming. Weak memory models do not 
simply increase the number of thread interleavings; they also confound program- 
mers, who typically have little intuition about how to reason about the behaviors 
induced by these additional interleavings. 

GENMC is a fully automatic verification tool meant for such programmers. 
It is a stateless model checker (SMC) [23] that can be used to verify bounded 
clients of intricate concurrent algorithms, such as implementations of synchro- 
nization primitives and shared data structures (e.g., queues, sets, and maps). It 
accepts as input a C/C++ program using C/C++11 atomics and/or the concur- 
rency primitives from the pthread library, and reports any data races, assertion 
violations, or other errors encountered. By default, verification is performed with 
respect to the RC11 memory model [32], but there are command line options for 
selecting other models, such as IMM [41] and LKMM [10]. 
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Since the theory underlying GENMC has already been published elsewhere 
[28,29,31], this paper focuses on the overall design of the tool and on various 
enhancements implemented in it. Our main design goals of GENMC were: 


Generality: The tool should be able to verify programs written in a variety of 
programming languages with respect to a variety of memory models. 
Efficiency: The tool should implement a state-of-the-art SMC algorithm and 
incorporate further optimizations for common programming patterns. 
Usability: The tool should provide useful and readable error messages. 
Extensibility: The tool should be easily adaptable to support additional models 
and synchronization primitives, and to tweak its performance. Extensibility 
is key to achieving the other goals, since it allows gradual improvements to 
the tool in terms of coverage, performance, and error detection/reporting. 


These goals are achieved by a combination of techniques: 


GENMC’s core SMC algorithm [29,31] is parametric in the choice of the 
memory model—subject to a few minimal constraints (see Sect. 2). 

The implementation is based on LLVM, a versatile intermediate language for 
multiple programming languages. 

GENMC follows a modular architecture minimizing dependencies across com- 
ponents (see Sect. 3), which makes it easy to extend with support for addi- 
tional memory models (Sect. 4) and synchronization primitives (Sect. 5). 

Its architecture contains hooks to provide fast approximate consistency 
checks, which are exploited by the memory model implementations (see 
Sect. 4). 

GENMC contains a number of optimizations that provide noticeable perfor- 
mance benefits on common workloads (Sect. 7). 

GENMC keeps additional metadata so as to present error messages in terms 
of variables names appearing in the source code (Sect. 6). 


GENMC has been applied to a few industrial settings, where it has found bugs 
and/or verified bounded correctness of concurrent libraries [39]. 


Related Work. There has been extensive work on SMC, with most tools focus- 
ing on sequential consistency [7,8,15,23,37]|. Tools that support weak memory 
models include CDSCHECKER [38] that verifies C/C++11 programs under the 
original C11 memory model, TRACER [5] that verifies C/C++11 programs under 
the RA model, RCMC [27] that verifies C programs under RC11 [32], and Nip- 
HUGG [1,2,4,12,13] that supports SC, TSO, PSO and provides limited support 
for the POWER and ARMv7 memory models. In contrast to GENMC, which 
uses the same core algorithm for all memory models, NIDHUGG uses multiple 
different algorithms depending on the memory model. 

There has also been work on adapting SAT/SMT-based bounded model 
checking (BMC) techniques for weak memory models [9,17,22]. DARTAGNAN 
[22] is a BMC tool that is parametric in the choice of the memory model, as it 
accepts the memory model as input in the litmus format [11]. 
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2 Memory Model Requirements 


GENMC’s core algorithm is parametric in the choice of the memory model pro- 
vided that it can be expressed in an axiomatic way and satisfies a few basic 
requirements that we describe below. 

Axiomatic memory models represent the executions of a concurrent program 
as execution graphs [11] that satisfy a certain consistency predicate. Execution 
graphs comprise a set of events (nodes) that represent the individual memory 
accesses performed by the program, and some relations on these events (edges). 
Example relations included in all memory models are the preserved program 
order (ppo) and reads-from (rf) relations: ppo relates events in the same thread 
that are ordered (e.g., by a chain of dependencies or a fence), while rf relates 
writes to reads reading from them. 

GENMC can be used to verify programs under such a model as long as the 
model’s consistency predicate fulfills the following requirements: 


No-Thin-Air: In consistent graphs, ppo Urf should be acyclic. This intuitively 
means that an event cannot circularly depend on itself. 

Prefix-Closedness: Restricting a consistent graph to any (ppo U rf)-prefix- 
closed subset of its events yields a consistent graph. Prefix-closedness enables 
the algorithm to construct a consistent graph incrementally. 

Extensibility: Adding a (ppo U rf)-maximal event to a consistent graph for 
some choice of an incoming rf-edge preserves consistency. This captures the 
intuitive idea that executing a program should never get stuck if a thread has 
more statements to execute. In particular, a read of x should always be able 
to return the value written by the most recent write to z. 


These requirements are satisfied by almost all axiomatic memory models 
(e.g., TSO [40], PSO [42], Power [11], ARMv7 [11], ARMv8 [21], RC11 [32], 
IMM [41], LKMM [10]). The only known axiomatic memory model that does 
not satisfy these requirements is the original formulation of the C/C++11 model 
[13], which has been criticized for its flaws [32,43]. 

Although these requirements cannot be satisfied by more advanced memory 
models that cannot be defined in an axiomatic fashion (e.g., [14,24, 25, 33]), there 
is ongoing work to support such a model. 


3 Tool Architecture 


Verification with GENMC comprises three stages (cf. Fig. 1, left). 

The first stage invokes clang to compile the source C/C++ program to 
LLVM-IR. To accommodate programs written in different languages, GENMC 
also accepts LLVM-IR as its input, provided that it adheres to certain conven- 
tions about thread creation. 

The second stage transforms the LLVM-IR code to make verification more 
effective by replacing spinloops by assume statements, bounding infinite loops, 
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Fig. 1. Overal architecture (left) and dynamic components (right). 


and performing sound optimizations, such as dead allocation elimination. It also 
collects additional debugging information to enable better error reporting. 

The third stage invokes the verification procedure, which explores all the 
executions of the program. If an error is found during this stage, the execution 
is halted and an error report is produced (see Sect. 6). 

The architectural subcomponents of this stage are depicted in Fig. 1 (right). 
At the center lies the verification driver, which owns three independent compo- 
nents: an execution graph, a work set, and an interpreter. 

The execution graph records the visited execution trace, and has routines 
for calculating various relation on the graph, such as the happens-before rela- 
tion. As each memory model comprises different relations, the execution graph 
contains multiple calculators that are dynamically populated when the graph 
is created, and the consistency predicate is calculated as a fixpoint of all the 
selected relations, whenever this is requested by the driver. 

The work set records alternate options for later exploration, the precise def- 
inition of which can depend on the memory model. 

The interpreter merely executes the user program, notifying the driver each 
time a “visible” action (e.g., a load/store to shared memory) is encountered. It is 
directly based on the LLVM interpreter 11i [35], and is the only part of our code 
base that heavily depends on LLVM. In turn, the driver modifies accordingly the 
execution graph, possibly pushes some items to the work set, and returns control 
back to the interpreter, along with a value that will be used by the interpreter, 
if necessary (e.g., in the case of a load). In effect, the driver and the interpreter 
can be thought of as coroutines [18]. The interpreter calls the driver whenever it 
encounters a visible action or finishes running a thread, while the driver monitors 
execution consistency, schedules the program threads, and discovers alternative 
exploration options, which are pushed to the work set. 

The aforementioned components are all parameterized by the user’s configu- 
ration options. The most important of these options is the memory model, which 
also determines whether dependencies between instructions should be tracked by 
the interpreter and stored in the execution graph. Another important option is 
when and how consistency is to be calculated. Since checking consistency at 
each step can be expensive for some memory models, it is possible to provide an 
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approximate consistency check to be applied at each step and only perform the 
full consistency check once an error is detected. 

To facilitate memory-model-specific optimizations, the driver is overridden 
for each memory model. Each instance sets up the (approximate) consistency 
checks and can provide specialized methods for crucial verification components. 


4 Supporting New Memory Models 


Adding support for a new memory model entails three basic steps. 

First, one has to provide definitions for any memory model primitives that the 
interpreter should intercept beyond those already supported (i.e., plain memory 
accesses and C/C++11 atomics). One can either provide a header file mapping 
these primitives to LLVM-IR instructions or create special event types for them. 

Second, one has to provide calculators for the memory model’s relations that 
are not already supported by GENMC. Depending on the memory model, this 
step may require a variable amount of effort, but it effectively boils down to 
translating relational calculations into matrix operations. 

Third, one can also provide approximations for the consistency checks. Such 
approximations entail storing crucial information about a memory model’s rela- 
tions as vector clocks (e.g., causally preceding events, for some notion of causal- 
ity), but deciding what to store is up to the user to decide and encode. Impor- 
tantly, GENMC’s performance depends not only on the calculators provided in 
the previous step, but also on the effectiveness of the approximations, which 
quickly filter out inconsistent exploration options. For instance, GENMC’s cur- 
rent RC11 driver treats SC accesses as release-acquire (RA) accesses (the con- 
sistency of which can be quickly determined), and only checks for full RC11 
consistency when an error has been triggered, a heuristic that seems to work 
well in practice for programs that have both SC and non-SC accesses. 

All in all, adding support for a memory model largely depends on the com- 
plexity of the model. Adding support for models like SC or RA is trivial, since 
such accesses are already supported as part of RC11 and IMM. In contrast, 
adding support for LKMM involved much more work, as we describe below. 


4.1 Supporting the Linux Kernel Memory Model (LKMM) 


LKMM [10] is a memory model that encompasses a variety of different architec- 
tures supported by the Linux kernel. As LKMM differs substantially from RC11 
and IMM, supporting it required all steps described above as well as a few other 
engineering decisions, the most important of which are discussed below. 

First, LKMM uses complex constraints for checking consistency of an exe- 
cution graph. As repeatedly calculating these constraints can be expensive, we 
designed approximations for them. Unlike most other memory models, LKMM 
does not define a suitable happens-before relation for checking coherence and 
detecting races. (Its hb relation cannot be used for this purpose.) We thus defined 
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a custom happens-before relation that can rule out inconsistent executions very 
quickly, and use it to approximate coherence and race detection checks. 

Second, although LKMM dictates that non-atomic accesses (called plain 
in LKMM’s jargon) only conditionally contribute to ppo, we incorporate such 
accesses in GENMC’s ppo (thus arriving at a stronger notion of ppo), mostly 
for technical reasons. Specifically, the calculation of dependencies between only 
non-plain accesses is difficult because each non-plain access in the source-code 
level may map to several plain and non-plain accesses in LLVM-IR level. 

To increase confidence in our implementation, we ran all litmus tests dis- 
tributed along with LKMM as part of the Linux kernel (32 tests in total), and 
compared our results with the results of the HERD [11] memory model simulator. 
Both tools explored the same number of executions for all tests. 

In addition, we extracted some manually written tests from LKMM’s supple- 
mentary repository [34] (categories atomic and kernel). We picked these cat- 
egories as they contain tests written in C pseudocode (thus easily translatable 
to C) and do not contain tests with plain accesses, which, as described, GENMC 
treats slightly differently from what LKMM dictates. In total, these categories 
amount to another 84 tests, from which we excluded two tests containing unsup- 
ported primitives, one test for which HERD did not terminate within 42h, and 
three tests that cannot be cleanly translated to C. Out of the remaining 78 tests, 
GENMC explores the same number of executions for 75 tests. The discrepancies 
observed in the three remaining tests are due to the different way the two tools 
produce and calculate dependencies. (In GENMC, control dependencies extend 
to all subsequent memory accesses of the same thread, whereas in HERD they 
extend only to the merge point of a conditional statement.) 

We note that HERD took about 18min to run all the above tests, while 
GENMC needed less than 2s. 


5 Supporting New Languages and Libraries 


Supporting additional programming languages is straightforward as long as they 
can be compiled to LLVM. This was, for example, the case when we extended 
GENMC to accept C++ (the initial version accepted only C input). All we had 
to do was to create stub header files for the C++ library, and to extend the 
interpreter to recognize the memory (de)allocation calls generated by clang. 
Supporting different runtime environments (e.g., JVM bytecode) requires 
constructing a new interpreter for the desired runtime system that calls the 
driver whenever a visible action is encountered. In addition, since the driver 
and the interpreter communicate using the LLVM type information, it may be 
necessary to add a translation layer between the interpreter(s) and the driver. 
Supporting new concurrency libraries requires localized changes. If the 
library’s semantics can be implemented in terms of memory accesses, one has 
to construct an appropriate header file or extend the interpreter to provide the 
mapping from library calls to the relevant memory access events. If this is not 
possible and/or if native support for a library is desirable (e.g., for performance 
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reasons), then the execution graph has to be extended with new kinds of events 
and the consistency checks have to be adapted accordingly. 

Next, we present two such library extensions, one mapping its calls to indi- 
vidual memory accesses, and the other creating new kinds of events. 


System Calls. As part of [26], we extended GENMC with support for system 
calls, such as open(), close(), read() and write(), which can be modeled by 
making multiple primitive calls (reads and writes) to a different address space. 

There are two ways one could implement these system calls: either by pro- 
viding an actual implementation (which would then be compiled to LLVM-IR) 
or by adding support in the interpreter to internally implement those calls and 
communicating multiple times with the driver. 

We preferred the latter solution because it is more portable. An external 
implementation would have to be manually ported whenever support for more 
languages is added. In contrast, the internal implementation needs no change. 
Further, even if a new interpreter for a different runtime system is added, it 
should be simple to decouple the system calls from the interpreter, and have the 
different runtime systems share the infrastructure that handles system calls. 


Barriers. N-way barriers are a widely-used synchronization primitive. They have 
two functions: barrier_init and barrier_wait. The former initializes a barrier 
object with the number of threads that will rendezvous at the barrier, while the 
latter is called every time a thread reaches the barrier. A thread that is calling 
barrier_wait blocks until the initially specified number of threads reaches the 
barrier, at which point all threads will be simultaneously unblocked, and the 
barrier value will reset to the one specified with barrier_init. 

Barriers can be straightforwardly implemented with a shared variable count- 
ing the number of threads that have called barrier_wait. But doing so yields 
poor model checking performance. For N threads calling barrier_wait, there 
are N! possible orders in which they can update the shared counter, thus crip- 
pling the performance of the tool. Tracking the order of these updates is not 
only expensive but also completely unnecessary. For many real-world use cases 
of barriers (e.g., scatter-gather workloads), the order in which different threads 
reached the barrier is irrelevant, and the thread that reached last unimportant. 

We leverage this intuition and provide built-in support for barrier_init 
and barrier_wait calls that does not track the relative ordering among 
barrier_wait calls synchronizing with one another, thereby achieving an expo- 
nential reduction in verification time. Concretely, in the simple program below 
where N threads execute barrier_wait concurrently, GENMC explores only one 
execution instead of N! executions: 


barrier_wait(); || ... || barrier_wait(); 


Our extension is called BAM (Barrier-Aware Model-checking) and is detailed 
and evaluated in a companion paper [30]. 
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6 Error Detection and Reporting 


GENMC detects a number of different kinds of errors: violations of user-supplied 
regular and persistency assertions, data races, memory errors and simple cases 
of termination errors. It reports errors by printing an offending execution graph 
and highlighting the event(s) that caused the violation. Upon request, GENMC 
can also print a total ordering of the instructions that lead to the violation, or 
produce the offending execution in the DOT graph description language. 


Persistency Errors. To verify persistency properties of programs performing file 
I/O, we allow user programs to contain a special recovery routine [26], which 
would typically check some invariant over the persisted state. 

When such a routine is present, GENMC simulates all possible ways in which 
the program could have crashed because of a power failure, executing the recov- 
ery routine at the end of every such execution. Of course, to avoid the obvious 
state-space explosion, the simulation of all the possible failures is done in an 
optimized fashion, driven by the memory accesses of the recovery routine. 

The performance of GENMC when verifying persistency properties of pro- 
grams under the ext4 filesystem has been evaluated at [26]. 


Memory Errors. Memory errors refers to accessing uninitialized, unallocated or 
deallocated memory. In models like RC11 [32], reasoning about memory safety 
can be tricky at times, as demonstrated by the example below: 


p :=alloc(); || if x = 1 then 
*D :=rix 42; a :=rix P; 
T :=r]x l; b :=rix *Q,; 


This example is erroneous under RC11 because the allocation of p is not guar- 
anteed to have propagated to the second thread by the time it is dereferenced. 
(Since all accesses are relaxed, there is no synchronization between the threads.) 

GENMC also accounts for more complicated scenarios such as p being con- 
currently freed when accessed, p being freed twice, or p being the address of a 
local (stack) variable that might not be alive when accessed. 


Refining Error Reports. It is often useful to refine the error reporting. For exam- 
ple, in memory models that treat data races as errors (such as RC11), GENMC 
by default detects data races and reports them as errors. This, however, can be 
costly in terms of verification time or even prohibit the verification of programs 
that use compiler/custom primitives to access shared memory, as such programs 
would almost certainly be considered racy. 

To deal with such cases, GENMC provides switches that disable race detec- 
tion and refine the range of errors that will be reported to the user. Switches 
of the latter kind are especially useful when dealing with programs that contain 
system calls. By default, when such system calls fail, GENMC reports an error, 
which is inconvenient for programs that contain proper error handling, as some 
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system errors are rather benign (e.g., a file not existing). With the appropriate 


switch, in case of system errors, an appropriate value is written in errno, as 
dictated by the POSIX standard. 


Case Study. We demonstrate the error reporting capabilities of GENMC with 
a real use case. We consider a flat-combining queue [19] that has been proposed 
to be ported in Rust’s crossbeam library. 

This queue serves as a nice case study for a couple of reasons. First, it con- 
tains loops that can diverge, and so its verification requires loop bounding, which 
GENMC can do automatically. Second, it is implemented using compiler primi- 
tives for concurrent accesses, and so its verification requires disabling race detec- 
tion. Third, while experimenting with it, we found it to be buggy. 

The error report produced by GENMC can be seen in Fig. 2. The error is 
quite intricate: it requires three threads to manifest, each of which executes 
a large number of instructions. The error is due to an ordering bug (relaxed 
accesses are used instead of release/acquire), which demonstrates the need for 
model checking tools that handle weak memory models. 


Error detected: Attempt to read from uninitialized memory! 
Event (3, 63) in graph: 
<-1, 0> main: 


<0, 1> thread_n: 


(1, 18): Urel (cmb.queue, 0) [(0, 36)] L.169: combiner.c 
(1, 19): Urel (cmb.queue, 2565579352) L.169: combiner.c 


(1, 96): Racq (m.msg._meta.next, 2565579416) [(2, 26)] L.228: combiner.c 


(1, 112): Wrlx (cmb.takeover, 2565579416) L.158: combiner.c 
<0, 2> thread_n: 


(2, 26): Wrel (m.msg._meta.next, 94798317999592) L.167: combiner.c 
<0, 3> thread_n: 


(3, 18): Urel (cmb.queue, 2565579352) [(1, 19)] L.169: combiner.c 
(3, 19): Urel (cmb.queue, 2565579480) L.169: combiner.c 


(3, 50): Rrlx (cmb.takeover, 2565579416) [(1, 112)] L.87: combiner.c 
(3, 63): Racq (m.msg._meta.next, 0) [BOTTOM] L.187: combiner.c 


Number of complete executions explored: 2795 
Number of blocked executions seen: 6001 
Total wall-clock time: 2.12s 


Fig. 2. An error report by GENMC after removing irrelevant lines. 


We note that the error report contains helpful debugging information, such 
as the names of variables accessed (e.g., m.msg.-meta.next) and the values 
read/written. To display this information, GENMC maintains a mapping from 
addresses to program variables using the additional debugging information col- 
lected in the “Transformation” phase. 
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7 Other Performance Enhancements to GENMC 


In this section, we briefly discuss two recent changes to the driver to optimize 
its performance for certain kinds of programs. 


Symmetry Reduction. Many programs, such as the flat-combining queue of 
Sect.6, have a symmetric structure: each thread runs the same code. In such 
cases, many execution graphs are equivalent up to some thread relabeling—a 
property that is exploited by symmetry reduction (SR) [16,20]. 

We implemented a simple SR algorithm that detects whether multiple threads 
with the same code are spawned with no intervening memory accesses, and avoids 
exploring executions for which a symmetric one (by relabeling such threads) has 
already been explored. This can yield exponential improvements. For example, 
a program with N threads incrementing a shared variable atomically has N! 
executions; employing SR yields only one execution. With SR, the verification 
time of the corrected flat-combining queue drops from 15s to 2.5s. 

To further demonstrate the benefits of SR, we measured the performance of 
GENMC with and without SR on some realistic lock implementations adapted 
from the literature. The results can be seen in Table 1. All reported times are in 
seconds, unless mentioned otherwise. We ran both GENMC versions three times 
for each benchmark, with an increasing number of threads each time (the initial 
thread number for each benchmark is provided in the second column). As it can 
be seen, SR leads to a significant performance improvement in all cases. 


Table 1. Testing lock implementations (1h timeout; 4GB memory limit) 


N | Without SR With SR 
N |N+1/)/N+2 |N | N4+1)N+2 


mutex 2 |0.02|}0.40 | 41min |0.03 0.08 | 164.66 
mutex-musl |2 0.01) 34.47)/00M_ | 0.01) 5.92 | oom 
rwlock 2 |0.02|}0.18 47.34 |0.04/0.05 | 1.94 
spinlock 3 |0.03/0.08 | 1.19 0.02 | 0.03 | 0.18 
ticketlock |4 0.02/)0.13 | 2.35 0.01 0.01 | 0.01 
ttaslock 3 |0.06|2.05 | 38min | 0.08) 0.11 | 33.87 
twalock 3 |0.03/}0.49 | 79.68 | 0.03) 0.04 | 0.36 


Lock-Aware Partial Order Reduction. A common problem with locking is that 
of false sharing, where N threads contend to acquire the same lock even if it 
is unnecessary for correctness. In such cases, GENMC’s partial order reduction 
algorithm [29] will explore all N! orders in which the lock can be acquired even 
though they all lead to the same outcome. 
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We have implemented lock-aware partial order reduction (LAPOR) [28], an 
enhancement to partial order reduction that does not track ordering among 
locks unless their critical regions have conflicting accesses, in which case the 
lock ordering is induced from the ordering among those accesses. With LAPOR, 
GENMC achieves exponential improvements in lock-based implementations of 
concurrent libraries that have false sharing, such as search trees with coarse- 
grained or hand-over-hand locking. LAPOR has been evaluated at [28]. 


8 Conclusion 


We presented GENMC, a state-of-the-art stateless model checker that can be 
used to verify consistency and persistency properties of C/C++ programs. We 
described its architecture, and how its modular design can be leveraged to 
account for new features and memory models. To widen the applicability of 
GENMC, we have extended it with support for LKMM, basic system calls and 
additional synchronization primitives. We have also improved its performance 
with optimizations, such as symmetry reduction and lock-aware partial order 
reduction that can exponentially decrease its search space. 

In the future, we plan to implement a DSL for memory models, so as to make 
it easier to extend GENMC with new models and quickly tweak their approxi- 
mation strategies. We are also planning to incorporate further optimizations into 
the tool to enable more effective verification of lock-free algorithms. 
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Abstract. A barrier certificate often serves as an inductive invariant 
that isolates an unsafe region from the reachable set of states, and hence 
is widely used in proving safety of hybrid systems possibly over the 
infinite time horizon. We present a novel condition on barrier certifi- 
cates, termed the invariant barrier-certificate condition, that witnesses 
unbounded-time safety of differential dynamical systems. The proposed 
condition is by far the least conservative one on barrier certificates, and 
can be shown as the weakest possible one to attain inductive invariance. 
We show that discharging the invariant barrier-certificate condition— 
thereby synthesizing invariant barrier certificates—can be encoded as 
solving an optimization problem subject to bilinear matrix inequalities 
(BMIs). We further propose a synthesis algorithm based on difference- 
of-convex programming, which approaches a local optimum of the BMI 
problem via solving a series of convex optimization problems. This algo- 
rithm is incorporated in a branch-and-bound framework that searches for 
the global optimum in a divide-and-conquer fashion. We present a weak 
completeness result of our method, in the sense that a barrier certificate 
is guaranteed to be found (under some mild assumptions) whenever there 
exists an inductive invariant (in the form of a given template) that suf- 
fices to certify safety of the system. Experimental results on benchmark 
examples demonstrate the effectiveness and efficiency of our approach. 


1 Introduction 


Hybrid systems are mathematical models that capture the interaction between 
continuous physical dynamics and discrete switching behaviors, and hence are 
widely used in modelling cyber-physical systems (CPS). These CPS may be 
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complex and safety-critical, with sensitive variables of the environment in its 
sphere of control. Everyday examples include process control at all scales, rang- 
ing from household appliances to nuclear power plants, or embedded systems 
in transportation domain, such as autonomous driving maneuvers in automo- 
tive, aircraft collision-avoidance protocols in avionics, or automatic train control 
applications, as well as a broad range of devices in health technologies, such as 
cardiac pacemakers. 

The safety-critical feature of these CPS, with increasingly complex behaviors, 
has initiated automatic safety or, dually, reachability verification of hybrid sys- 
tems [1, 15]. The problem of reachability verification is undecidable in general [1], 
albeit with decidable families of sub-classes (see, e.g., [2,16-18,31]) identified in 
the literature. The hard core of the verification problem lies in reasoning about 
the continuous dynamics, which are often characterized by ordinary differen- 
tial equations (ODEs). In particular, when nonlinearity arises in the ODEs, the 
explicit computation of the exact reachable set is usually intractable even for 
purely continuous dynamics [49]. 

Therefore in the literature, a plethora of approximation schemes, as surveyed 
in [15], for reachability analysis of hybrid systems has been developed, including 
an invariant-style reasoning scheme known as barrier certificate [41]. A barrier 
certificate often serves as an inductive invariant that isolates an unsafe region 
from the reachable set, thereby witnessing safety of hybrid systems possibly over 
the infinite time horizon. A common way to synthesize barrier certificates is to 
reduce the condition defining barrier certificates to a numerical optimization or 
constraint solving problem. There is, however, a trade-off between the expres- 
siveness of the barrier-certificate condition and the efficiency in discharging the 
reduced constraints. Hence, to enable efficient algorithmic synthesis of barrier 
certificates via, e.g., linear programming (LP), second-order cone programming 
(SOCP), semidefinite programming (SDP) and interval analysis [11,30], the gen- 
eral condition on inductive invariance (that a barrier certificate defines an invari- 
ant, see [8,51]) has been strengthened into a spectrum of different shapes, e.g., 
[8,29,51,60,62]. It has been, nevertheless, a long-standing challenge to find a 
barrier-certificate condition that is as weak as possible while admitting efficient 
synthesis algorithms. 

In this paper, we present a new condition on barrier certificates, termed 
the invariant barrier-certificate condition, based on the sufficient and necessary 
condition on being an inductive invariant [36]. Our invariant barrier-certificate 
condition is by far, to the best of our knowledge, the least conservative one 
on barrier certificates, and can be shown as the weakest possible one to attain 
inductive invariance. We show, by leveraging Putinar’s Positivstellensatz [32], 
that discharging the invariant barrier-certificate condition —thereby synthesiz- 
ing invariant barrier certificates— can be encoded as solving an optimization 
problem subject to bilinear matriz inequalities (BMIs). We further show that 
general bilinear matrix-valued functions can be decomposed as a difference of two 
psd-convex (extension of convexity to matrix-valued functions) functions using 
eigendecomposition, thus resulting in a synthesis algorithm as per difference-of- 
convex programming (DCP) [33,52], which solves a series of convex sub-problems 
(in the form of linear matrix inequalities (LMIs)) that approaches (arbitrarily 
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close to) a local optimum of the BMI problem. This algorithm is incorporated in 
a branch-and-bound framework that searches for the global optimum in a divide- 
and-conquer fashion. We present a weak completeness result of our method, in 
the sense that a barrier certificate is guaranteed to be found (under some mild 
assumptions) whenever there exists an inductive invariant (in the form of a given 
template) that suffices to certify the system’s safety. A similar result on complete- 
ness is previously provided only by symbolic approaches, yet to the best of our 
knowledge, not by methods base on numerical constraint solving, e.g., [4, 60,61]. 
Experiments on a collection of examples suggested that our invariant barrier- 
certificate condition recognizes more barrier certificates than existing conditions, 
and that our DCP-based algorithm is more efficient than directly solving the 
BMIs via off-the-shelf solvers. 

Due to space restrictions, proofs and benchmark details have been omitted; 
they are found in an extended version of this paper [57]. 


2 A Bird’s-Eye Perspective 


We use the following example to give a bird’s-eye view of our approach. 


Example 1 (overview [11]). Consider the following continuous-time dynamical 
system modelled by an ordinary differential equation: 


$= tı) _ T1 + T2 
~ Ata) — \z1z2 — 0.52% + 0.1) ` 


The verification obligation is to show that the system trajectory originating from 
any state in the initial set Xo = {x | Z(x) < 0} with Z(x) = z? + (x2 — 2) — 1 
will never enter the unsafe set Xu = {x | U(x) < 0} with U(x) = z2 + 1. < 


A barrier certificate satisfying our condition in Definition 4 serves as an 
inductive invariant that suffices to isolate the unsafe region X, from the set of 
reachable states from Xo, thereby proving safety of the system over the infinite 
time horizon. To this end, we proceed in the following steps. 


1) Encode as Sum-of-Squares (SOS) Constraints. We set a (polynomial) 
barrier-certificate template B(a,x) = az with unknown coefficient a € R. 
According to Theorem 1, we only need to consider Lie derivatives up to order 
Nez = 1, i.e., L4 B(a,x) = ax, and L}B(a, x) = a(aix2 — 0.525 + 0.1). 

By Theorem 5, B(a,x) is an invariant barrier certificate if there exists a 
polynomial v(x), SOS polynomials o(x), g'(x) and a constant € > 0 such that 


— ax +o(x) (x? + (z2 — 2)? — 1), 1.1, initial 
2 +0(x) (x1 + (#2 — 2)* — 1) ( ) 
B : 
—a (z1z2 — 0.525 + 0.1) + v(x) axe, (1.2, Lieconsecution) 
te ama “7 
LLB LEB 
azo +0'(x) (ag +1) —€ (1.3, separation) 
B 
u 


are SOS polynomials. We set e = 0.01 in this example. 
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2) Reduce to a BMI Optimization Problem. Observe that the above SOS 
constraints can be formulated as BMI constraints. For instance, let us assume 
that (1.2) is an SOS polynomial of degree at most 2 and v(s, x) = s9 +8121 +222 
is a template polynomial with unknown coefficients s. Then constraint (1.2) is 
equivalent to the BMI constraint 


—0.la 0 0.5aso 
Fo(a,s) = — 0 0 0.5(ası—a)| <0 
0.5aso 0.5(ası — a) asz + 0.5a 


meaning that the bilinear matrix (LHS of <) is negative semidefinite. Note that 
the bilinearity arises due to the coupling of the unknown coefficients a and s. 

Constraints (1.1) and (1.3) can be reduced to BMI constraints in an analogous 
way!, yielding F, and F3. It then follows that, to solve the SOS constraints, we 
need to find a feasible solution (a,s) such that? 


Fi(a,s) < 0A Fo(a,s) < 0A Fa(a,s) < 0. (2) 


To exploit well-developed optimization techniques, the feasibility problem (2) 
is transformed to an optimization problem subject to BMI constraints: 


maximize À 
subject to B;(A,a,s) S F;(a,s) +ATX0, i= 1,2,3 (3) 


where J is the identity matrix with compatible dimensions. Note that problem (2) 
has a feasible solution if and only if the optimal value \* in (3) is non-negative. 


3) Decompose as Difference-of-Convex Problems. The problem (3) con- 
tains non-convex constraints and hence does not admit efficient (polynomial- 
time) algorithms tailored for convex optimizations. However, by our technique 
presented in Sect. 5, a non-convex function B;(A,a,s) can be decomposed as the 
difference of two psd-convex (defined later) matrix-valued functions: 


— pt - 
B,(A,a,s) = Bi (A,a,s) — B; (A, a,s). (4) 
The decomposition of Bj(A,a,s), for instance, gives 
By (A, a,s) = 
8A + 0.08a + a? + 0.40882 0.40889 81 —2aso + 0.816s9s2 
t 0.4085s051 8A + a? + 0.4088? 4a — 2ası + 0.81651 52 
—2aso + 0.816s0s2 4a — 2ası + 0.816s1s2 8\ — 4a + 2.449a? — dasa + s2 + sẹ? + 1.63283 
By (à, a,s) = 
a? + 0.40852 0.408s051 2aso + 0.816s0s2 
x 0.40859 81 a? + 0.4085? 2ası + 0.81681 52 
2aso + 0.816s9s2 2ası + 0.81651 52 2.449a” + dasa + s + s? + 1.63283 


' Despite that no bilinearity is involved in constraints (1.1) and (1.3), they can be 
processed in the same way as (1.2), yielding LMI constraints. 

? Extra constraints on a(x) and g'(x) being SOS polynomials can be encoded anal- 
ogously in the feasibility problem, yet are omitted here for the sake of simplicity. 
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4) Solve a Series of Convex Sub-problems. Now, we apply a standard 
iterative procedure in difference-of-convex programming [10] as follows. Given 
a feasible solution z* = (\*,a*,s*) to the BMI optimization problem (3), the 
concave part —B; (\,a,s) in (4) is linearized around z*, thus yielding a series of 
convex programs (k = 0,1,...): 


maximize A 
A,a,s 


subject to BY (z) — B7 (z*) — DB; (z*) (z2-z*)<0, i=1,2,3 (5) 


where DB; denotes the derivative of the matrix-valued function 5; . 

The soundness of our approach asserts that the feasible set of the linearized 
program (5) under-approximates the feasible set of the original BMI program (3). 
Therefore, if AF > 0 after iteration k, we can safely claim that (a*,s”) is a feasible 
solution to (2). A barrier certificate B(x) is then obtained by substituting aë in 
B(a,x). Moreover, if we take the optimum z** of (5) to be the next linearization 
point z**1, the solution sequence {z"},cn converges to a local optimum of (3). 

We show that the linearized program 
(5) is equivalent to an LMI optimiza- 
tion problem admitting polynomial- 3} 
time algorithms, say the well-known 
interior-point methods supported by 
most off-the-shelf SDP solvers. Our iter- 
ative procedure starts with a strictly 
feasible initial solution z° to program %4 
(3) and terminates with àA? > 0 
(subject to numerical round-off) and 
a? = —0.00363421, yielding the barrier ot 
certificate a ee i 2 3 


BI) <0 
U(x) <0 


B(x) <0 


2 
= —0. < 0. 
Blaya) o UUanatalma = 0 Fig. 1. Phase portrait of the system in 


Example 1. The arrows indicate the vec- 
tor field and the solid curves are ran- 
domly sampled trajectories. 


Figure 1 depicts the system dynamics 
and the synthesized barrier certificate. 

We remark that the aforementioned 
iterative procedure on solving a series of convex optimizations converges only 
to a local optimum of the BMI problem (3). This means that, in some cases, it 
may miss the global optimum that induces a non-negative A*. We will present 
in Sect. 6 a solution to this problem by incorporating our iterative procedure 
into a branch-and-bound framework that searches for the global optimum in a 
divide-and-conquer fashion. 


3 Mathematical Foundations 


Notations. Let N, N+, R, R* and RẸ be respectively the set of natural, positive 
natural, real, positive real and non-negative real numbers. For a vector x € R”, 
x; refers to its i-th component and ||x|| denotes the ¢?-norm; for a matrix A € 
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R"*™, A(t, j) refers to its (i, 7)-th element. Let R[x] be the polynomial ring in 
x over the field R. A polynomial h € R[x] is sum-of-squares (SOS) iff there exist 
polynomials gi,...,9% € R[x] such that h = 5 g?. We denote by [x] c R[x] 
the set of SOS polynomials over xx. S” denotes the space of nxn real, symmetric 
matrices. For A € S”, A > 0 means that A is positive semidefinite (psd, for 
short)’, i.e., Vx € R”: x'Ax > 0. A matrix-valued function B: R” —> S™ is psd- 
convex on a convex set C C R” if Vx1,x2 € C. Vu € (0,1): B(uxı + (1 — u)x2) < 
B(x) + (1 — u)B(xə). 


Differential Dynamical Systems. We consider a class of continuous dynami- 
cal systems modelled by ordinary differential equations of the autonomous type: 


x = f(x) (6) 


where x € R” is the state vector, x denotes its temporal derivative dx/dt, with 
te RE modelling time, and f: R” — R” is a polynomial flow field (or vector 
field) that governs the evolution of the system. A polynomial vector field is local 
Lipschitz, and hence for some T € R* U {oo}, there exists a unique solution (or 
trajectory) ¢x,: [0, T) — R” originating from any initial state xọ € R” such that 
(1) x (0) = xo, and (2) Yr € [0, T): xo i= = F(Cx0(7)). We assume in the 
sequel that T is the maximal instant up to which ¢, exists for all xo. 


Remark 1. Our techniques on synthesizing barrier certificates in this paper focus 
on differential dynamics of the form (6). However, we foresee no substantial 
difficulties in extending the results to multi-mode hybrid systems where extra 
constraints on the system evolution, e.g., guards, are present. 


Safety Verification Problem. Given a domain set ¥ C R”, an initial set 
Xo C X and an unsafe set Xa C X, the reachable set of a dynamical system of 
the form (6) at time instant t € [0, T) is defined as Ræ, (t) = {Cx (t) | xo E€ Vo}. 
We denote by Rx, the aggregated reachable set, i.e., the union of Rx, (t) over 
t € [0,T)*. The system is said to be safe iff Rx, N Xu = 0, and unsafe otherwise. 
For simplicity, we consider ¥ = R” throughout this paper. 

To avoid the explicit computation of the exact reachable set, which is usu- 
ally intractable for nonlinear hybrid systems (cf., e.g., [15]), barrier-certificate 
methods make use of a partial differential operator, termed the Lie derivative, 
to capture the evolution of a barrier function along the vector field: 


Definition 1 (Lie Derivative [28]). Given a vector field f: R” — R” over 
x, the Lie derivative of a polynomial function B(x) along f, LEB: R” — R of 
order k € N, is 
ck B(x), k= 0, 
B(x) = p 
di (ELBO) Fw), b> 0 
3 More generally, for A,B € S”, A < B indicates that B — A is positive semidefinite. 


4 This subsumes the problem of unbounded-time safety verification where a unique 
solution exists over the infinite time horizon [0, 00). 


Synthesizing Invariant Barrier Certificates via DCP 449 


where (-,-) is the inner product of vectors, i.e., (u,v) = Y; uvi for u,v € R”. 


The Lie derivative Li B(x) is essentially the k-th temporal derivative of the 
(barrier) function B(x), and thus captures the change of B(x) over time. 

An inductive invariant © C R” of a dynamical system is a set of states such 
that all the trajectories starting from within ¥ remain in W: 


Definition 2 (Inductive Iinvariant [40]). Given a system (6), a set Y C R” 
is an inductive invariant of system (6) if and only if 


Vxo € Y. Vt € [0, T): Cx, (t) E€ Y. (7) 


In the sequel, we refer to inductive invariants simply as invariants. In [36], a 
sufficient and necessary condition on being a polynomial invariant is proposed: 


Theorem 1 (Invariant condition [36]). Given a polynomial B € R[x], its 
zero sub-level set {x | B(x) < 0} is an invariant of system (6) if and only if ° 


Ney i-1 . i Nef i 
B<0 = Vaa (N3582 = 0) ALL B < 0) VA, £72 =9 (8) 


where Ng € Nt is a completeness threshold, i.e., a finite positive integer that 
bounds the order of Lie derivatives, which can be computed using Gröbner bases®. 


In contrast, a barrier certificate is a function whose zero sub-level set isolates 
an unsafe region X, from the reachable set Ræ, w.r.t. some initial set Xo: 


Definition 3 (Semantic Barrier Certificate [51]). Given a system (6), an 
initial set Xo and an unsafe set Xu, a barrier certificate of (6) is a differentiable 
function B: R” — R satisfying 


Yxo € Xo. Vt € [0, T): B(¢x,(t)) <0 and Vx € a: B(x) > 0. (9) 


The existence of such a barrier certificate trivially implies safety of the system. 
Moreover, one may readily verify that if some set WY = {x | B(x) < 0} is an 
invariant and satisfies (Yo C W)A(WN AX, = 0), then B(x) is a barrier certificate. 

As observed in [51], however, the semantic statement in Definition 3 encodes 
merely the general principle of barrier certificates [8], yet in itself is not that 
useful for safety verification because it explicitly involves the system solutions. 
Therefore, in order to enable efficient synthesis, the semantic condition on barrier 
certificates has been strengthened into a handful of different shapes (see, e.g., [8, 
29,41,60], which all imply inductive invariance). It has been yet a long-standing 
challenge to find a barrier-certificate condition that is as weak as possible while 
admitting efficient synthesis algorithms. 

Our BMI encoding of the invariant barrier-certificate condition (cf. Sect. 4) 
roots in Putinar’s Positivstellensatz, which characterizes positivity of polynomi- 
als on a semi-algebraic set defined by a system of polynomial inequalities: 


5 In (8), No LÍ B = 0 is true for i = 0 by default. This applies in the sequel. 
6 Ney is the minimal i such that ire is in the polynomial ideal generated by 
L$ B, L} B, aa JLB: The ideal membership can be decided via Grobner basis. 
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Theorem 2 (Putinar’s Positivstellensatz [32]). Let K = {x | N; 9i(x) > 
0} be a compact semi-algebraic set defined by gi1,.--;9m E R[x]. Assume the 
Archimedean condition holds", i.e., there exists L € R* such that L — ||x||? = 
oo(x) + X; o1(x)gi(x) for some o0,...,0m E [x]. If h € R[x] is strictly 
positive on K, then 


h(x) = ox) +2", o)l) 
holds for some SOS polynomials o9,...,0m E€ ©|x]. 


We now recall a key technique used in our reduction to semidefinite opti- 
mizations. Given a symmetric matrix X € S” partitioned as X = ( a . with 


invertible A, the Schur complement of A in X is defined as X/A = D—C'A7!C. 
An important property of the Schur complement X/A is that it characterizes 
the positive semidefiniteness of the block matrix X: 


Theorem 3 (Schur Complement [3]). If A> 0, then X > 0 iff X/A = 0. 


We apply the Schur complement in Sect.5 to transform nonlinear convex con- 
straints into linear constraints. 


4 Invariant Barrier-Certificate Condition as BMIs 


In this section, we present our invariant barrier-certificate condition (see Defi- 
nition 4) based on the necessary and sufficient condition on being an inductive 
invariant (cf. Theorem 1), and show how to encode it as BMI constraints. 


4.1 Invariant Barrier-Certificate Condition 


Definition 4 (Invariant Barrier Certificate). Given a system (6), an ini- 
tial set Xo and an unsafe set X,, a polynomial function B: R” — R is an 
invariant barrier certificate of system (6) if and only if 


1. (initial): Vx € Xo: B(x) < 0; 
2. (consecution): Yx € R”: Ave! thes LÍ B(x) = 0) = L B(x) < 0) : 
3. (separation): Vx € Xu: B(x) > 0. 


Notice that the consecution constraint in Definition 4 involves Lie derivatives 
of orders up to Ng yz € NT, as is the case in Theorem 1. Our invariant barrier- 
certificate condition hence generalizes existing conditions on barrier certificates, 
e.g., [4,60,63], which consider Lie derivatives only up to the first order. 

The consecution condition in Definition 4 is in fact equivalent to the invariant 
condition (8) in Theorem 1 (cf. [57, Lemma 2]), thereby revealing the relation 
between an inductive invariant and an invariant barrier certificate: 


7 This condition can be met by adding a (redundant) constraint gm4i(x) = Lo — 
\|x||? < 0, provided that a bound Lo € Rt is known such that Vx € K: Lo—||x||? > 0. 
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Theorem 4 (Inductive Invariance). Given a system (6), an initial set Xo 
and an unsafe set Xu. If B(x) is an invariant barrier certificate, then Y = {x | 
B(x) < 0} is an invariant. Conversely, if YW = {x | B(x) < 0} is an invariant 
satisfying Xo CW and WN Xu = 0, then B(x) is an invariant barrier certificate. 


It follows from Theorem 4 that our invariant barrier-certificate condition is 
the least conservative one on barrier certificates to attain inductive invariance. 


Remark 2. We do not employ the invariant condition (8) in Theorem 1 as the 
constraint on the consecution of Lie derivatives. This is because our consecution 
condition in Definition 4 is simpler, and in particular, amenable to more straight- 
forward transformations to SOS constraints via Putinar’s Positivstellensatz, as 
shown later in Subsect. 4.2. 


Remark 3. For a fixed 0 < St < Ng f, the consecution condition in Definition 4 
can be strengthened in the following way while preserving inductive invariance: 


Vx € R”: A. (ee: = 0) = LiB(x) < 0) A 


(Ane £} B(x) = 0) = LT B(x) < 0) 


where for the N-th Lie derivative, one needs LB (x) < 0 (rather than Lee (x) < 
0). In practice, using such a strengthened consecution condition —with less sub- 
constraints to solve— may yield more efficient synthesis. 


4.2 Encoding as BMI Optimizations 


Next, we show how to encode synthesizing an invariant barrier certificate (cf. Def- 
inition 4) as an optimization problem subject to BMIs. To this end, we first recast 
the invariant barrier-certificate condition into a collection of SOS constraints®. 


Theorem 5 (Sufficient Condition for Invariant Barrier Certificate). 

Given a system (6), an initial set Xo = {x | T(x) < 0} and an unsafe set 
Xu = {x | U(x) < 0}. A polynomial B € R[x] is an invariant barrier certificate 
of (6) if for some e € R*, there exist vi į; € R[x] and SOS polynomials a(x), o' (x) 
s.t. 


1. —B(x) + o(x)Z(x), l 

2. forall 1 <i< Neg, -L4 B(x) + DE vi (x)L4 B(x), 
3. B(x) + 0'(x)U(x) -e 

are SOS polynomials. 


By enforcing the Archimedean condition and applying Putinar’s Positivstel- 
lensatz, we further derive a necessary condition of invariant barrier certificate: 


8 For simplicity, we assume that Xo and %, are both captured by a single polynomial. 
Our formulations, however, apply also to cases with basic semi-algebraic Xo or Xu. 
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Theorem 6 (Necessary Condition for Invariant Barrier Certificate). 
Given a system (6), an initial set X = {x | T(x) < 0} and an unsafe 
set X, = {x | U(x) < 0}. If B e R[x] is an invariant barrier certificate 
of (6), then for some e € R*, there exist vij € R[x] and SOS polynomials 
a(x), 0’(x), p(x), p (x), pi! (x) s.t. for any L € RF, 


1. —B(x) + p(x)(||xl|’ — L) + o(x)Z(x) +, ; T l 
2. for alll < i< Nps, -L4 B(x) + p; (x)(x — L) + pao vi L4 B(X) + 6, 
3. B(x) + p(x) (lx? — L) + o/(x)U(x) 


are SOS polynomials. 


Notice that a polynomial B(x) satisfying the sufficient condition in The- 
orem 5 suffices as an invariant barrier certificate that witnesses safety of the 
system. In contrast, a polynomial B(x) satisfying the necessary condition in 
Theorem 6 may serve as a candidate invariant barrier certificate, and safety of 
the system can be concluded via a posterior check? of B(x) per Definition 4. 

Next we show how to encode an SOS constraint of the shape “h(x) € 2'|[x]” 
in Theorems 5 and 6 as a BMI constraint. To this end, we first set a template 
polynomial'® B(a,x) parameterized by unknown real coefficients a as the barrier 
certificate. We then proceed by setting templates for the remaining unknown 
polynomials (e.g., v;,;(x)) and SOS polynomials (e.g., a(x) and p(x)) in h(x), 
with all the parameters in these templates grouped into s. Observe that the 
parameterized SOS polynomial h(a,s,x) is of a bilinear form on the parameter 
spaces, i.e., h(a,s,x) is linear in a and s separately. However, nonlinearity arises 
in the combined parameter space (a,s) due to the product couplings of a and s, 
i.e., vi,;(8i,;,x)L4 B(a,x) in the consecution constraint. 

Now the problem of synthesizing an invariant barrier certificate boils down to 
searching for an instantiation of the parameters a and s such that the sufficient 
condition in Theorem 5 holds (or alternatively, the necessary condition in Theo- 
rem 6 holds and the posterior check passed). Such an instantiation of a (making 
B(a,x) an invariant barrier certificate) will be called valid in the sequel. 

Suppose that a parameterized SOS polynomial h(a,s,x) is of degree at most 
2d, with user-specified d € N. Then h(a,s, x) can always be written in quadratic 


form as h(a,s,x) = b'Q(a,s)b, where b = (1,21, £2, 21%2,..., 2%) is the basis 
vector of size p = a) containing all monomials of degree up to d, and Q(a,s) € 


SP is a parameterized real symmetric matrix known as the Gram matrix [6]''. 


An important fact states that h(a,s,x) is SOS if and only if Q(a,s) > 0. 

Let F(a,s) = —Q(a,s). As per h(a,s,x), the matrix-valued function F(a, s) 
is bilinear in (a,s). Observe that h(a,s,x) is SOS if and only if the BMI con- 
straint F(a,s) < 0 holds. See Example 1 for an illustration of this BMI encoding. 


? Such a check inherits decidability of the first-order theory of real-closed fields [53]. 

10 A template polynomial g(a, x) is required to be linear in its parameters a. 

11 Extracting the Gram matrix amounts to solving a system of linear equations resulting 
from coefficient matching. The derived Gram matrix may contain extra unknowns if 
the system of linear equations admits multiple solutions, which nevertheless can be 
encoded in our subsequent workflow by enumerating the basis of its null space. 
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In general, F(a,s) can be flattened in an expanded bilinear form as 


F(a, s) =f + ar aiH; + er SjGj + a ae azs; Fij 


where m and n are the size of a and s, respectively; F, Hi, Gj, Fij € SP are 
constant matrices. Discharging the conditions of invariant barrier certificates 
hence amounts to solving the BMI feasibility problem of finding a and s s.t. 


F,(a,s) 30, 6=1,2,...,1. (10) 


Here F(a,s) is indexed by ų and / is the number of SOS constraints involved. 
To exploit well-developed techniques in optimization, the feasibility problem 
(10) is transformed to an optimization problem subject to BMI constraints: 


maximize À 
subject to F,(a,s)+AI<0, 1 =1,2,...,1. (11) 


A solution (A,a,s) to (11) is feasible if it satisfies the BMIs in (11), and strictly 
feasible if all the BMIs are satisfied with strict inequalities. We sometimes drop 
the A component in the solution when it is clear from the context. Notice that 
problem (10) has a feasible solution if and only if the optimal value à* in the 
BMI optimization problem (11) is non-negative. 

To achieve (weak) completeness of our method in subsequent sections on 
solving the BMI optimization problem, we make the following assumption on 
the boundedness of the search space (a,s) of the optimization. 


Assumption 1 (Boundedness on the Parameters). Every feasible solution 
(a,s) to the BMI problem (11) is in a compact set with non-empty interior, i.e., 


(a,s) € Ca x Cs = {(a,8) | llall? < La, llsl? < Zs} 


for some known bounds La, Ls € RF. 


Remark 4. The boundedness on a in Assumption 1 makes sense in practice since 
we usually prefer barrier certificates with bounded coefficients. Moreover, when 
the bilinear functions F,(a,s) in (11) are affine in a and s, i.e., with a zero 
constant matrix F, the parameters a and s can be scaled independently by any 
positive factor. Therefore in this case, w.l.o.g, one may simply set La = Ls = 1. 


5 Solving BMI Optimizations via DCP 


The BMI optimization problem (11), derived from the synthesis problem, is 
known to be NP-hard and contains non-convex constraints [55], and hence is not 
amenable to efficient (polynomial-time) algorithms committed to solving convex 
optimizations. In this section, we present an algorithm for solving general BMI 
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optimizations via difference-of-convex programming [33,52], which solves a series 
of convex sub-problems that approaches a local optimum of (11). 

For brevity, we consider optimization problems with a single BMI 
constraint!?: 


maximize g(z) 
z=(x,y) 


m n m n 
subject to B(x,y) =F + 5 xiH; + 5 YjGj + D> tiy Fij s 0 (12) 
i=1 j=1 i=1 j=1 
where the objective function g: R™+™®” — R is linear in z = (x, y); 
F, H;i, Gj, Fij € SP are constant symmetric matrices. 


5.1 Difference-of-Convex Decomposition 


The key challenge in solving the BMI problem (12) is its non-convexity, that is, 
the matrix-valued function B(x,y) is, in general, not psd-convex. 

There have been attempts, most pertinently in [10], to decompose a bilin- 
ear function as a difference between two psd-convex functions, known as the 
difference-of-convez (DC) decomposition, such that the optimization in its 
decomposed form enjoys well-established techniques in difference-of-convex pro- 
gramming [33,52]. The DC decomposition in [10], however, is confined to BMIs 
of a specific structure, namely, XTY + YTX < 0, where X and Y are matrix 
variables containing variables x; and yj, respectively. The more general bilinear 
function B(x, y) in (12) does unfortunately not admit straightforward forms of 
decomposition such as those in [10, Lemma 3.1]. 

In what follows, we present a difference-of-convex decomposition of the 
matrix-valued function B(x, y), inspired by [58], using eigendecomposition. 

First, observe that the function B(x,y) can be written as 


T 
_ (x@l 0 I\ /(x@el x@l 

B(x,y) = ee (r 7 (Si) + (a 25) (Zor) (13) 

where & denotes the Kronecker product: for two matrices A € R**? and B € 


R°*4, AQ B 2 [A(1,1)B,..., A(1,b)B; `- ; A(a,1)B,..., A(a,b)B] € R°°*è4, 0 
represents the zero matrices with compatible dimensions, and 


Fia P Fin 
r=- : E : ; Nı = (Hı ... Hm), M = (Gi ... Gn). 
Pm ises Emn 
The form of (13) implies that B(x,y) is psd-convex if the matrix M = 
( = 7) is positive semidefinite. Unfortunately, as [58, Theorem 1] points out, 
for a non-trivial bilinear function B(x, y), M may not be positive semidefinite. 


12 Multiple BMI constraints can be joined as a single BMI in a block-diagonal fashion. 
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Nevertheless, the matrix M can always be decomposed as M = Mı — M2 
with Mı, Mə = 0, i.e., a difference between two psd-matrices. One way to do so 
is to use the eigendecomposition of the (real symmetric!*) matrix M € S(™+™P, 
That is, M = V'DV, where the orthogonal matrix V contains the eigenvectors 
of M; D is a diagonal matrix whose diagonal elements are the eigenvalues of M. 

Let D* be the matrix obtained by setting all negative elements of D to zero 
and D~ = Dt — D. We have 


M=V'D*V.-V'D-Y. 
SNS em 
Mı Mə 
It follows that Mı, M2 > 0 and therefore we find a DC decomposition of B(x, y): 
Theorem 7 (Difference-of-Convex Decomposition). The following form 
B(x, y) = Bt (x,y) — B (x,y) (14) 


where 


is a difference-of-convex decomposition of B(x,y). Namely, the matrix-valued 
functions B* (x,y) and B~(x,y) are psd-conver on R™*™, 


Remark 5. In practice, the aforementioned matrices M, Mı and Mə induced by 
eigendecomposition are often highly sparse. One can hence exploit the sparsity 
to improve the algorithmic performance of the DCP-based synthesis approach. 


5.2 Reduction to LMIs 


On top of the DC decomposition (cf. Theorem 7), we can now apply a standard 
iterative procedure in difference-of-convex programming [10] to solve the BMIs. 

The core idea of the procedure is to iteratively solve a series of convex sub- 
problems. More specifically, given a feasible solution z* = (x*,y”) to the BMI 
optimization problem (12), the “concave part” —B~ (x,y) in (14) is linearized 
around z*, thereby yielding a series of convex programs (k = 0,1,...): 


ee g(z) + La |z — z* ||? 
subject to Bt(z) —B™ (z*) — DBT (z*) (z—2*) x0 (15) 


where DB- (z): R™*" — SP is the derivative of the matrix-valued function B7 
at z, i.e., a linear mapping from a vector u € R™*” to a matrix in S?: 


DB-(z)(u) = ot woe 


u;—— 
isl" Oz; 
13 M thus only has real eigenvalues. 


(z). 
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Algorithm 1: BMI-DC: Solving BMIs based on DC decomposition 


input: A BMI optimization problem (12) with a strictly feasible initial solution 2°. 
output: A sequence of feasible solutions S = {2°, aay 2k} to the BMI optimization. 
k-—0; Se {2°}; 
M < reformulation of (12) as (13); 
(Mı, M2) — DC decomposition of M as in (14); 
repeat 
Construct the convex sub-problem (15) out of (M1, M2) linearized around z*; 
z*+1 — optimum of the program (15); 
S SU {zkt1}; > S keeps track of visited points 
kek+ 1; 
until \|z* — zk-1|| < e€ for a given tolerance € € R¢; 


oxy Aaa A © ND KF 


© 


10 return S; 


An extra regularization term $6||z — z*||? with 6 < 0 is added in (15) to 
enforce that g(z) strictly increases after each iteration until it stabilizes, which 
can be encoded as a second-order cone constraint and embedded in SDP solving. 

Note that the linearized problem (15) is convex and therefore can be solved 
efficiently'* via methods including, among others, augmented Lagrangian meth- 
ods [35] and gradient descent methods [3]. Furthermore, the Schur complement 
in Theorem 3 implies that (15) can be reformulated as an LMI problem: 


Theorem 8. The quadratic matrix inequality (QMI) constraint 
Bt (z) — B- (z*) — DBT (z2*) (2—2*) <0 


in (15) is equivalent to the LMI constraint!” 


—I N(z@ I) 20 
(z@1)'N' —B- (z*) — DB- (2*) (z2—2*) + Q(2@1D+F) 7 
where N is the square root matriz of Mı, i.e., Mı = N'N, and Q= (Q Nə). 


Theorem 8 entails that the series of linearized convex sub-problems of the 
form (15) can be solved alternatively by most off-the-shelf SDP solvers desig- 
nated for discharging LMIs via polynomial-time algorithms, say the interior- 
point methods. Furthermore, by taking the optimum of the k-th sub-problem to 
be the next linearization point z*+1, we obtain an iterative procedure for solving 
general BMIs, as depicted in Algorithm 1. 

Algorithm 1 falls into the DCP framework [10] and thus enjoys useful prop- 
erties, e.g., soundness, termination and convergence as follows. 


14 The global optimum of (15) is attainable under standard assumptions, e.g., Slater’s 
condition and the second-order sufficient KKT conditions [3]. 
15 This transforms a QMI with matrices in S? to an LMI with matrices in SOP" +)? 
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Theorem 9 (Soundness). Every solution zê = (xt, yt) € S with i =0,...,k 
returned by Algorithm 1 is a feasible solution to the original BMI problem (12). 


The result below states termination and convergence of Algorithm 1 in terms 
of KKT points of (12), i.e., solutions fulfilling the KKT conditions [3] of (12)'°. 


Theorem 10 (Termination and convergence). If (12) has finitely many 
KKT points, then (1) fore € Rt, Algorithm 1 terminates; (2) for e = 0, Algo- 
rithm 1 visits an infinite sequence of solutions converging to a KKT point. 


We remark that, under some sufficient KKT conditions and regularity con- 
ditions [3], a KKT point suffices as a local optimum. In this case, the infinite 
sequence {z'}icy of points visited by Algorithm 1 (for e = 0) converges to a 
local optimum of (12). 


5.3 Finding the Initial Solution 


The iterative procedure in Algorithm 1 starts with a fed-by-oracle strictly feasible 
initial solution z° to the BMI problem (12). Finding such an initial solution, 
however, is non-trivial in general due to the non-convexity of (12). We argue 
though, that a strictly feasible initial solution can be obtained for the BMI 
problem of the form (11) induced by the barrier-certificate synthesis problem. 

Recall that in the BMI problem (11), bilinearity arises from the multiplication 
of B(a,x) with some unknown multiplier polynomials parameterized by s. One 
way to reduce the BMI constraints to LMIs is to fix every multiplier polynomial 
to be a non-negative constant, thereby yielding a linear program: 

maximize À 


A,a 


subject to F,(a,s)|, o HASO, 6=1,2,...,1 (16) 


=(c.,0,..., 
where s in F,(a,s) is substituted by (¢,,0,...,0) with c, € Rf, which encodes 
a non-negative constant multiplier polynomial. Observe that no s-variable is 
involved in (16) and the constraints therein are linear in a. 

Apparently, a strictly feasible solution (A, a) to (16) induces a strictly feasible 
solution (A,a,(c,,0,...,0)) to (11) as well. Moreover, we have 


Lemma 1. The LMI program (16) always has a strictly feasible solution. 


As a consequence, a strictly feasible solution to the BMI problem (11) can 
be obtained by solving the LMI problem (16). In fact, when considering Lie 
derivatives only up to the first order, solving (the feasibility counterpart of) (16) 
is exactly the procedure to synthesize either an exponential barrier certificate [29] 
(with c, € R*) or a convex barrier certificate [41] (with c = 0). Algorithm 1 
therefore subsumes existing synthesis techniques in the sense that any valid 
barrier certificate synthesized by methods in [29,41] can also be discovered by 
Algorithm 1. Moreover, an alternative way to reduce the BMI constraints to 
LMIs is to fix the multipliers to be some given non-trivial (SOS) polynomials [62]. 


16 Addressing the KKT conditions in detail falls outside the scope of this paper. 
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Algorithm 2: Branch-and-Bound: Searching for a valid parameter a 
input: A BMI optimization problem of the form (11) with Ca = {a | |al|? < La}. 
output: A valid parameter ā, or otherwise | indicating a failure. 

1 if La <7 then return 1; > abort on fine-enough partitions (7 € Rt) 
/* sample-and-check is not necessary if Theorem 6 is used */ 


2 a+ arandomly-sampled point in Ca; 
3 if a is valid then return 4a; > check validity (inductive invariance) 
4 if proja (Sg) N Ca = 9 then > Sj contains a global set of visited points 
5 S — apply BMI-DC in Algorithm 1 to (11) with initial solution in (Ca,Cs); 
6 Sylb 4 Sglb U S; 
/* checking validity is not necessary if Theorem 5 is used */ 
7 if a valid parameter ā € proj,(S) is found then return a; 
s (C1,C2) — bisect(Ca); > partition the parameter space 


ə a~< Branch-and-Bound(C}); 
io if a~ | then return a; 
11 else return Branch-and-Bound(C2); 


Remark 6. Different choices of the multiplier constants c, in (16) may lead to 
different initial solutions fed to Algorithm 1, thereby considerably different num- 
ber of iterations until termination. In practice, techniques like randomization are 
worth exploring when choosing these multiplier constants. 


6 Incorporating in a Branch-and-Bound Framework 


The aforementioned iterative procedure on solving a series of convex optimiza- 
tions converges only to a local optimum of the BMI problem (11) (or more 
generally, (12)). This means that, in some cases, it may miss the global opti- 
mum that induces a non-negative \*. We present in this section a solution to 
this problem by incorporating the iterative procedure into a branch-and-bound 
framework that searches for the global optimum in a divide-and-conquer fashion, 
as is a common technique in non-convex optimizations. 

The basic idea is as follows. We first try to solve the BMI problem (11) by 
Algorithm 1 over the compact parameter space (Ca, Cs). If a valid solution, (i-e., a 
solution that contains a valid parameter a € Ca such that B(a,x) is an invariant 
barrier certificate) is found, then the corresponding barrier certificate can be 
obtained. Otherwise, we keep bisecting Ca and apply Algorithm 1 over each 
bisection!’. The procedure, as depicted in Algorithm 2 in a recursive manner, 
terminates when a valid parameter is found or the partition is fine enough. 

Algorithm 2 takes as input a BMI problem of the form (11) that encodes 
either the sufficient condition in Theorem 5 or the necessary condition in The- 
orem 6 for invariant barrier certificates. In the former case, a sample-and-check 
process (Line 2-3) is necessary to attain (weak) completeness (see Theorem 11). 
The conditional statement in Line 4 rules out parameter (sub-)spaces that have 


17 The validity of ā € Ca does not depend on s, thus we do not partition Cs. 
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already been explored, which is the case when the projection of some visited 
point in Sgp (a global set that keeps track of visited points by Algorithm 1, 
initialized as Ø) onto a is in the current parameter space. 

The following theorem claims a weak completeness result: our method guar- 
antees to find a barrier certificate when there exists an inductive invariant (in 
the form of a given template) that suffices to certify safety of the system. 


Theorem 11 (Weak Completeness). Algorithm 2 returns a valid parameter 
a € Ca, if (1) the partition granularity is fine enough (i.e., small enough n € R* ), 
(2) the degrees of multiplier polynomials and SOS polynomials used to form (11) 
are large enough, and (8) there exists, for the given template B(a,x), a strictly 
valid parameter â € Ca (i.e., any parameter in some neighborhood of a is valid). 


Remark 7. The bisection operation in Algorithm 2 induces —in the worst case— 
an exponential blow-up in the number of branches. In practice, one can prune 
branches inducing only negative objective values, via, e.g., convex relaxation [26]. 


7 Experimental Results 


We have carried out a prototypical implementation!® of our synthesis techniques 
in Wolfram MATHEMATICA, which was selected due to its built-in primitives for 
SDP, polynomial algebra and matrix operations. Given a safety verification prob- 
lem as input, our implementation works toward discovering an invariant barrier 
certificate (in the form of a given template) that witnesses unbounded-time safety 
of the system. A collection of benchmark examples (detailed in [57, Appendix B]) 
has been evaluated on a 2.10 GHz Intel Xeon processor with 376GB RAM run- 
ning 64-bit CentOS Linux 7. 

Table 1 reports the empirical results. BMI-DC concerns our locally-convergent 
Algorithm 1 for solving BMIs (encoding the sufficient condition in Theorem 5) 
based on DC decomposition. We compare our approach with PENLAB [14]—an 
off-the-shelf solver in MATLAB for directly discharging the same BMI problems 
(with no guarantee on convergence)—and SOSTOOLS [39]—for solving LMIs 
derived from Prajna and Jadbabaie’s original barrier-certificate condition [41]. 
The comparison is performed under the same problem configurations!®. Due 
to numerical errors caused by floating-point computations and the fact that 
reaching the local/global optimum does not necessarily yield a valid barrier 
certificate, we additionally perform a posterior check, via both the quantifier- 
elimination procedure in MATHEMATICA and the SMT solver Z3 [37], of the 
synthesized candidate barrier certificate per Definition 4. 

Table 1 shows that BMI-DC suffices to synthesize valid barrier certificates in 
most of the examples within a reasonable number of iterations (i.e., the number of 
convex sub-problems solved by SDP). This however does not cover all the cases: 


18 Available at © https://github.com/Chenms404/BMI-DC. 
19 For PENLAB and SOSTOOLS, we use their optimized, built-in criteria for termina- 
tion and methods for finding the initial solutions. 
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Table 1. Empirical results on benchmark examples (time in seconds) 


Example name Nsys dow dec BMI-DC PENLAB SOSTOOLS 
#iter. Time Verified Time Verified Time Verified 
overview [11] 2 2 1 2 0.03 y 0.31 v 0.07 y 
contrived 2 1 2 0 0.01 v 0.48 v 0.75 v 
lie-der [36] 2 2 1 0 0.01 Vv 0.22 VY 004 y 
lorenz [11] 3 2 2 8 2.37 S 1I X 147 X 
Iti-stable [19] 2 1 2 0 0.01 Vv 023 Y 014 ¥ 
lotka-volterra [21] 3 2 1 3 0.07 v 0.36 v 0.21 y 
clock [43] 2 3 1 0 0.01 Vv 088 X O18 xX 
lyapunov [44] 3 3 2 4 1.25 v 56.98 x 0.35 y 
arch1 [50] 2 5 2 0001 VY 3376 X O31 Vv 
arch2 [50] 2 2 2 5 0.37 Vv 0.38 xX O17 Xx 
arch3 [50] 2 3 2 1007 Vv 054 Y 018 Vv 
arch4 [50] 2 2 1 2 0.09 Vv 0.49 xX 0.06 Vv 
barr-cert1 [41] 2 3 2 12 0.85 y 2.53 x 0.09 x 
barr-cert2 [11] 2 2 2 6 1.57 v 1.16 x 0.15 y 
barr-cert3 [63] 2 2 1 0 0.01 v 0.20 v 0.11 x 
barr-cert4 [63] 2 3 2 13 0.96 y 0.89 x 0.23 x 
fitzhugh-nagumo [47] 2 3 2 2 0.16 v 1.24 v 0.25 x 
stabilization [48] 3 2 2 9 2.88 v 55.22 v 0.11 v 
lie-high-order 2 1 2 32 4.12 v 1.56 x 0.25 x 
raychaudhuri [13] 4 2 2 34 9.51 v 33.64 x 0.14 x 
focus [42] 2 1 4 100 54.89 x 0.95 x 0.48 x 
sys-biol [27] 7 2 2 2 73.22 7 101.95 g 1.35 ? 
sys-bio2 [27] 9 2 1 1 1.03 ? 15.54 ? 0.16 % 
quadcopter [19] 12 1 1 0 0.03 ? 65.42 ig 0.36 7 


Nsys: system dimension; dfw: maximal flow-field degree; dgc: degree of the template barrier certificate. 
#iter.: number of iterations. 0 means that the initial solution (cf. Subsect. 5.3) is valid. 

verified: the synthesized barrier certificate is valid (V), invalid (X) or inconclusive (?, beyond the 
capability of quantifier elimination in MATHEMATICA and nonlinear reasoning in Z3). 

time: CPU-time, excluding that for casting the BMIs/LMIs. Boldface marks the winner among /’s. 


for the focus example, the solution is close enough to a local optimum (after 
100 iterations) but yields still an invalid barrier certificate. This problem can be 
solved (if there exists an invariant barrier certificate as specified) by enforcing 
the branch-and-bound framework as presented in Sect.6. The phase portraits of 
a selected set of examples and the synthesized invariant barrier certificates are 
depicted in Fig. 2 (see more in [57, Appendix B)). 

The comparison in Table 1 suggests that (1) Our invariant barrier-certificate 
condition recognizes more barrier certificates than the original (more conserva- 
tive) condition as implemented in SOSTOOLS. In particular, the lie-high-order 
example does admit an inductive invariant in the form of the given template, 
but none of the existing barrier-certificate conditions [4,60,63] —concerning Lie 
derivatives only up to the first order— recognizes it, since we have L} B(x) = 0 
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Fig. 2. Phase portraits of a selected set of examples with the synthesized invariant 
barrier certificates. The arrows indicate the vector field (hidden in 3D-graphics for a 
clear presentation) and the solid curves are randomly sampled trajectories. 


for some x on the boundary of B and hence it requires to exploit the second-order 
Lie derivative £4 B; (2) Our DCP-based synthesis algorithm finds more barrier 
certificates in less time than directly solving the BMI problems via non-convex 
optimization techniques as implemented in PENLAB. 

We remark that symbolic methods based on, e.g., quantifier elimination [36], 
can hardly deal with any of the examples listed in Table 1 due to the prohibitively 
high computation complexity. Moreover, it would be desirable to pursue a com- 
parison with the augmented Lagrangian method for solving BMIs as proposed 
in [4], which unfortunately is not yet possible due to the unavailability of the 
implementation thereof. We will discuss crucial differences to [4] in Sect. 8. 


8 Related Work 


As surveyed in [15], the research community has, over the past three decades, 
extensively addressed the automatic verification of safety-critical hybrid systems. 
The almost universal undecidability of the unbounded-time reachability prob- 
lem [1], however, confines the sound key-press routines to either semi-decision 
procedures or approximation schemes, most of which address bounded-time ver- 
ification by, e.g., computing the finite-time image of a set of initial states. 

Invariant generation [36,41], amongst others, is a well-established approxima- 
tion scheme that provides a reliable witness for safety (or equivalently, unreach- 
ability) of dynamical systems over the infinite time horizon. Invariants can be 
constructed in various forms, e.g., barrier certificates [41,51] and differential 
invariants [36,40]. With a priori specified templates, the invariant synthesis 
problem can be reduced to numerical optimizations or constraint solving, as 
in, e.g., [22,25, 46,54]. 

Most pertinently, Prajna and Jadbabaie proposed in their seminal work [41] 
a concept coined barrier certificate to encode invariants. To enable efficient 
synthesis via semidefinite programming, the barrier-certificate condition in [41] 
strengthens the general condition encoding inductive invariance. Since then, sig- 
nificant efforts have been investigated in developing more relaxed (i.e., weaker) 
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forms of barrier-certificate condition that still admit efficient synthesis, thereby 
leading to, e.g., exponential-type barrier certificates [29], Darboux-type barrier 
certificates [62], general barrier certificates [8] and vector barrier certificates [51]. 
To attain efficient synthesis, these barrier-certificate conditions share a com- 
mon property on convexity. That is, if for some aj,a2 € R™, B(ai,x) and 
B(az,x) both satisfy the barrier-certificate condition, then for any 0 < u < 1, 
B(pa, + (1 — )ag,x) must also satisfy the barrier-certificate condition. 

However, neither the semantic barrier-certificate condition (9) encoding the 
general principle of barrier certificates [8,51] nor the inductive invariant con- 
dition (8) is convex. This means, when resorting to convex barrier-certificate 
conditions, one may miss some potential barrier certificates that suffice as induc- 
tive invariants witnessing safety. Therefore, non-convex conditions were sug- 
gested [60], for which the synthesis problem can be reduced to BMI problems 
solvable via customized schemes, e.g., the augmented Lagrangian method [4] 
and the alternating minimization algorithm [63]. Our synthesis techniques also 
exploit a BMI reduction, with three crucial differences: (1) our invariant barrier- 
certificate condition is equivalent to the inductive invariant condition in the sense 
of Theorem 4, and thus is less conservative than all the aforementioned condi- 
tions which consider Lie derivatives only up to the first order; (2) our DCP-based 
techniques for solving BMIs naturally inherit appealing results on convergence 
and (weak) completeness, which are not (and can hardly be) provided by the 
approaches in [4,60,63]; (3) our DCP-based iterative procedure visits only fea- 
sible solutions to the original BMI problem, and hence whenever a solution that 
induces a non-negative objective value is found, we can safely terminate the algo- 
rithm and claim a feasible solution to the original BMI problem, which may yield 
a valid barrier certificate. This is not the case for the approaches in [4, 60,63]. 

Beyond barrier certificates, Wang and Rajamani [58] investigated the feasi- 
bility problem of general BMI problems with an application to multi-objective 
nonlinear observer design. The technique of eigendecomposition was also used 
therein to conduct the DC decomposition. The decomposed concave part, how- 
ever, is simply ignored and no iterative procedure that exhibits convergence to 
a local optimum can be provided. 

The idea of augmenting a locally-convergent algorithm with a branch-and- 
bound framework to find the global optimum has been exploited in the realm 
of optimization [20] and control [56]. In contrast, our method is designed for 
the specific problem of barrier-certificate synthesis, and hence our branch-and- 
bound algorithm concerns only the parameter space of a, i.e., coefficients of the 
template barrier certificate. 

Finally, we refer interested readers to other approaches to solving BMI prob- 
lems, e.g., rank minimization [23,38,45], sequential SDP [7,12], as well as meth- 
ods committed to general non-convex optimizations, e.g., interior point trust- 
region [5,9,34], successive linearization [24] and primal-dual interior point [59]. 
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9 Conclusion 


Barrier certificates are powerful tools to prove time-unbounded safety of hybrid 
systems. We have presented a new condition on barrier certificates—the invariant 
barrier-certificate condition. This condition is by far the least conservative one on 
barrier certificates, and can be shown as the weakest possible one to attain induc- 
tive invariance. We showed that our invariant barrier-certificate condition can be 
reformulated as an optimization problem subject to bilinear matrix inequalities, 
which can be solved by our locally-convergent algorithm based on difference-of- 
convex programming. By incorporating this algorithm into a branch-and-bound 
framework, we obtained a weak completeness result. Experiments on benchmark 
examples suggested that our invariant barrier-certificate condition recognizes 
more barrier certificates than existing conditions, and that our DCP-based algo- 
rithm is more efficient than directly solving the BMIs via off-the-shelf solvers. 
We stress that our techniques for solving BMIs are of a general nature rather 
than being confined to barrier-certificate synthesis. Interesting future directions 
include to extend our method to other synthesis problems, e.g., discovering 
invariants and/or termination proofs of deterministic/probabilistic programs. 


Acknowledgements. The authors would like to thank Hengjun Zhao for the fruitful 
discussion on differential dynamics requiring high-order Lie derivatives. 
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Abstract. In this paper, we propose a safe reinforcement learning app- 
roach to synthesize deep neural network (DNN) controllers for nonlinear 
systems subject to safety constraints. The proposed approach employs an 
iterative scheme where a learner and a verifier interact to synthesize safe 
DNN controllers. The learner trains a DNN controller via deep reinforce- 
ment learning, and the verifier certifies the learned controller through 
computing a maximal safe initial region and its corresponding barrier 
certificate, based on polynomial abstraction and bilinear matrix inequal- 
ities solving. Compared with the existing verification-in-the-loop synthe- 
sis methods, our iterative framework is a sequential synthesis scheme of 
controllers and barrier certificates, which can learn safe controllers with 
adaptive barrier certificates rather than user-defined ones. We implement 
the tool SRLBC and evaluate its performance over a set of benchmark 
examples. The experimental results demonstrate that our approach effi- 
ciently synthesizes safe DNN controllers even for a nonlinear system with 
dimension up to 12. 
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1 Introduction 


The design and synthesis of controllers for dynamical systems is a fundamental 
problem in the field of control. In recent years, with the boom of deep learning, 
there has been considerable research activities in the use of deep neural net- 
works (DNNs) for control of cyber-physical systems such as unmanned aerial 
vehicles, self-driving cars, etc. [33]. For these safety-critical systems, one of the 
most important and challenging problems is safe controller synthesis, that is, 
to synthesize a controller guaranteeing that the system’s trajectory will never 
intersect with an undesired region. 

A number of techniques included under the umbrella of Deep Reinforcement 
Learning (DRL) have been used to effectively learn controllers from user-defined 
reward functions encoding desired system behavior [17,36]. A majority of these 
works lack formal reasoning about the safety of such DNN-controlled dynamical 
systems from such learning process. To guarantee the safety property of syn- 
thesized DNN controllers, considerable works focus on the safety verification of 
DNN-controlled closed-loop systems, which is a really hard problem because it 
is tangled with highly nonlinear DNN expressions. The main research on this 
topic is through reachable set estimation of DNN-controlled systems, which can 
only deal with time bounded safety property [11,12,18,19,37]. On the other 
hand, other than formally verifying synthesized DNN controllers, more recent 
works have been proposed to learn DNN controllers for dynamical systems with 
safety guarantees [8,39,40]. For example, a verification-in-the-loop DNN con- 
troller training algorithm is presented in [8], which integrates RL framework 
with user-provided control barrier functions (CBFs) for reward function encod- 
ing, combined with SMT based formal CBF checking; a correctness-by-design 
method is proposed in [39] that first learns DNN controllers and barrier cer- 
tificates simultaneously using supervised learning, and then performs posterior 
formal verification of barrier certificates via SMT solvers. 

In this paper, we propose a safe reinforcement learning approach to synthesize 
DNN controller for nonlinear systems subject to safety constraints via barrier 
certificate generation. The proposed approach employs an iterative scheme where 
a learner and a verifier interact to synthesize safe DNN controllers. Firstly, the 
learner applies DRL method to train a DNN controller by encoding the safety 
requirement (and the barrier certificate requirement, if applicable) into reward 
function. For the learned controller, the verifier computes a Maximal Safe Input 
Region (MSIR) and the corresponding barrier certificate. Once the MSIR is a 
superset of the prescribed initial set O, it is easy to see that the safety of the 
closed-loop system under the learned controller with O is verified. Otherwise, the 
computed barrier certificate needs to be adjusted and fed to guide the learner 
to retrain a new controller. The above inductive loop repeats until an MSIR 
enclosing O is computed. 
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Compared with [8], a user-provided barrier certificate is adopted for reward 
function encoding and the barrier certificate is fixed through the learning process, 
whereas in this paper the controllers and the barrier certificates are synthesized 
simultaneously and yielded in a larger state space, which increases the diver- 
sity and flexibility of barrier certificates. Meanwhile, the barrier certificates in 
our approach are computed by numerical optimization method, which is more 
efficient than the SMT based method in [8]. Compared with [39], our method 
is based on RL framework and thus has better data sampling efficiency than 
the meshing-based data set generation in [39] for supervised learning. Besides, 
our method is iterative so that can utilize intermediate learned results to guide 
learning in the next iteration, rather than restarting from scratch as in [39] when 
a learned barrier certificate failed formal checking. Thanks to these advantages, 
our method has really good performance in efficiency and scalability even for 
problems with dimension up to 12. 

The main contributions of this paper are summarized as follows: 


— We propose a safe reinforcement learning via barrier certificate generation to 
synthesize DNN controller, which can guarantee the unbounded-time safety 
of the closed-loop systems. 

— Our synthesis approach employs a sequential iterative scheme, where DNN 
controllers and the corresponding barrier certificates are synthesized alterna- 
tively, and in each iteration, barrier certificates are slightly adjusted to guide 
retraining safe DNN controllers quickly. 

— We provide a detailed experimental evaluation on a set of benchmarks, which 
shows the efficiency and effectiveness of our approach. 


The paper is organized as follows. Section2 gives a brief introduction to 
the safe controller synthesis problem. Section3 describes an iterative scheme 
of safe reinforcement learning for safe DNN controller synthesis. In Sect. 4, we 
provide an overall algorithm with a detailed example attached to depict how 
the algorithm works. In Sect.5, we present an experimental evaluation of our 
algorithm over a set of benchmark examples. We compare with related works in 
Sect.6 before concluding in Sect. 7. 


2 Preliminaries 


Notations. Let R and N be the field of real number and natural number, respec- 
tively. R[x] denotes the ring of polynomials with coefficients in R over variables 
x = [21,22,...,2n]", and R[x]” denotes the n-dimensional polynomial ring vec- 
tor. Let R[x]q C R[x] be the vector space of polynomials of degree at most d. 
Let N7 := {a € N” : $; a; < d}. Denote by &[x] C R[x] (resp. X[x]a C R[x]2a) 
the space of sums of squares (SOS) polynomials. 

Consider a continuous dynamical system of the form 


x = f(x), (1) 
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where x = (71,...,%n)? € R” and f = (fi,..., fn)? € R[x]” is the vector field 
defined on the state space D C R”. We assume that f satisfies the local Lipschitz 
condition, so that (1) has a unique solution x(t,x 9) in D for every initial state 
xo € D at time t= 0. 

In many contexts, a dynamical system is equipped with a domain ¥ C D and 
an initial set © C W, represented as a triple C = (f,0,W). Given a prespecified 
unsafe region X„ C D, we say that the system C is safe if all system trajectories 
starting from O can not evolve into any state specified by X,,, which has been 
widely investigated in safety critical applications. 


Definition 1 (Safety). For a constrained continuous dynamical system 
(CCDS) C = (£, V, O) and a given unsafe region Xu, the system is safe if for all 
Xo € O, there does not exist tı > 0 such that 


Vt € [0, t1].x(t, xo) € W and x(t1, Xo) € Xu, 


that is, the system’s trajectory never reaches X,, from O as long as it remains 
in W. 


Remark 1. If the trajectory x(t, Xo) first leaves W and then enters ¥ again, then 
by Definition 1, the part of the trajectory from the first exit point is excluded 
from our concern and is not relevant to the safety of the considered CCDS. 


In this paper, we consider a controlled CCDS C = (f£, W, O) with continuous 
dynamics defined by 
x=f(x,u 
{ (x,u) (2) 


u = k(x), 


where x € Y C R” are the system states, u € U C R™ are the control inputs, 
and f : Y x U — R” and k : W — U are the locally Lipschitz continuous vector 
field and feedback controller function, respectively. The problem we considered 
in this paper is defined as follows. 


Definition 2 (Safe Controller Synthesis). For a controlled CCDS C = 
(f,Y, O) with £ defined by (2) and a given unsafe region Xu, design a locally 
Lipschitz continuous feedback control law k such that the closed-loop system C 
with f£ = f(x, k(x)) is safe as per Definition 1. 


The concept of barrier certificates plays an important role in safety verifica- 
tion of continuous systems. The essential idea is to use the zero level set of a 
barrier certificate B(x) as a barrier to separate all the reachable states from the 
unsafe region. The following theorem states the conditions that must be satisfied 
by a barrier certificate. 


Theorem 1 /26]. Given a continuous system C = (f,¥,0), and the unsafe 
region Xu. Suppose there exists a real-valued function B : W — R satisfying the 
following conditions: 


(i) B(x) >0 Vx Ee, 
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(ii) B(x) <0 Vx Ee Xu, 
(iii) B(x) =0=> LyB(x)>0 VxeY, 


where Ls B(x) denotes the Lie-derivative of B(x) along the vector field f(x), i.e., 
Le Bla) = yy gb - fi(x), then B(x) is a barrier certificate, and the safety of 
system C is guaranteed. 


Corollary 1. For a controlled CCDS C = (f,V,O) with £ defined by (2), a 
feedback control law u = k(x) can be used to ensure the safety control of C, if 


there exists a barrier certificate for the closed-loop system under the control law 
k(x). 


Throughout this paper, we assume that the initial set ©, the domain VY 
and the unsafe set X„ are compact semi-algebraic sets, defined by polynomial 
equations and inequalities. Concretely, the semi-algebraic sets O,W and X,, are 
represented as follows: 


O: = {x E€ R” |gi(x) >0,i=1,...,mi}, 
Y: = {x € R” |h;(x) > 0,j =1,..., m2}, 
Xu: = {x € R” | q(x) > 0,4 =1,...,ms}, 


for some polynomials g;, hj, qx € R[x]. 


3 Synthesis of Safe Controller via Learning and 
Verification 


In this section, we introduce an iterative framework for synthesizing a deep neural 
network (DNN) controller for a CCDS subject to safety constraints. As shown in 
Fig. 1, the procedure is structured as an inductive loop between a learner and a 
verifier. The learner trains a DNN controller using reinforcement learning. The 
trained DNN controller is passed to the verifier, which checks the safety of the 
closed-loop system under the trained controller via barrier certificate generation. 

Observing Fig. 1, we first apply the reinforcement learning method to train a 
neural network controller u = k(x) in terms of the target of the safety satisfiabil- 
ity, and then try to yield a barrier certificate B(x) based on the bilinear matrix 
inequalities (BMI) solving, to guarantee the safety of the closed-loop system with 
the controller k(x). 

However, for the system with the controller k(x), such barrier certificate B(x) 
may not exist. The reasons are twofold: (i) the controller k(x) is trained through 
the trajectories starting from finite points in the initial set O; (ii) the existence 
of the barrier certificate is just a sufficient condition of the safety of the given 
system. 

In this situation, for the learned controller k(x), one may compute a Maximal 
Safe Input Region (MSIR) ©, and the corresponding barrier certificate B(x), 
which can guarantee the safety of the continuous system with respect to the 
initial set ©}. Once O, is a superset of the prescribed initial set O, i.e., O C O4, 
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Fig. 1. The framework of safe neural network controller synthesis. 


it is easy to see that the safety of the system with O is verified. Otherwise, 
we need adjust the barrier certificate B(x) and the controller k(x) sequentially. 
This operation is able to build an iterative framework, wherein each iteration 
proceeds in two stages: 


~ Update the neural network controller. We apply deep reinforcement 
learning method to obtain the updated controller k;(x) by feeding B;_1(x), 
which is the barrier certificate yielded from the above iteration (See the 
learner in Fig. 1). 

— Compute the barrier certificate with the maximal safe input region. 
With the updated controller k;(x), we transfer the problem of barrier certifi- 
cate generation into a bilinear matrix inequalities (BMI) solving, and then 
compute the maximal region O; with the corresponding barrier certificate 
B;(x). Namely, the existence of B;(x) suffices to prove the safety of the sys- 
tem with respect to the initial set O;. Once O; encloses the original initial set 
O, i.e., O C O;, the current controller k;(x) is the desired safe one. Otherwise, 
we need refine B;(x), and then go to the next iteration (See the verifier in 
Fig. 1). 


3.1 Training of Safe Controller 


In the following, we focus on the learner component of Fig. 1 and show how 
to train a safe controller using deep deterministic policy gradient (DDPG) [23], 
which is a popular reinforcement learning approach suited for continuous control 
applications. The DDPG combines the value-based and policy-based method, 
and is made up of two parts: actor and critic. The critic uses the off-policy data 
to learn the action-value function, which evaluates how good the action k taken 
is in the given state x. The actor can learn the continuous action policy by 
using the action-value function. In practice, it is difficult to obtain the exact 
action-value function and policy function. Thus, two deep neural networks are 
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introduced to solve this problem, i.e. the critic network Q(x, u|@@) and actor 
network k(x|G") with weights 8 and 3", respectively. 

The reward function should be appropriately designed to achieve the goal of 
safety controller synthesis via reinforcement learning. For safe controller synthe- 
sis, the task is to synthesize a DNN controller such that all the trajectories of 
the closed-loop system starting from O can not evolve into the unsafe region Xu. 
Thus, the reward function is preliminarily defined as 


fi = By i dist( Xu, Xz) 


where 3; > 0 is the scale factor, and dist(X,,,x;) denotes the distance between 
the state x, and the unsafe region X,,. In addition, according to the third condi- 
tion of Theorem 1, once the trajectory hit the zero level set of barrier certificate 
it must satisfy Lf B(x+) > 0; otherwise, the system behavior should be penalized. 
For this purpose, the reward function is updated as 


fi — min(b2|LfB(x+)|, Armin), |B(xz)| < ô and Ly, B(xt) <0 
Tt = A . (3) 
Tis otherwise 


where £,B(xt) = X; OF) f(x u), B2 > 0 is the scale factor, 6 is a small 
positive value characterizing the zero-level set of B, and Armin > 0 is the thresh- 
old avoiding too large fluctuations of reward value. In this work, we set 3, = 1.0, 
Bo = 1.0, 6 = 0.1, Armin denotes the size of W. Since 0 < fe < Armin, the setting 
rz (3) can be kept within a certain range, making the convergence effect better. 


Algorithm 1. Barrier Certificate Guided Reinforcement Learning 

Input: CCDS C; unsafe region X,,; barrier certificate B(x) 

Output: DNN Controller k 

1: Initialize critic Q and actor k, corresponding target networks Q’ = Q and k’ =k 
2: Initialize barrier certificate B(x) = L and replay buffer R = 0 

3: Sample initial states from O and store them to Qe 

4: for xo E€ Qe do 


5 for t= 1,---,T do 

6 calculate u; = k(x+) 

T: calculate X++1 = X: + f (x+, ue) 

8 calculate r; = r(X#41, Xu, B(x)) 

9: store (Xz, Xt+1, Ut, Tt) to R 

10: Sample random minibatch of transitions from R 
11: Update critic Q and actor k 


12: end for 

13: Update the target networks Q’ and k’ 
14: end for 

15: return k 


To synthesize the safety controller using reinforcement learning, a dataset of 
sampled trajectories is needed. To sample trajectories, we first generate a set of 
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initial states from O. Let 1, u € R” be the vectors of the lower and upper bounds 
of O, i.e., O C [l, u]. We first sample from each dimension of |l, u] equidistantly 
with a fixed mesh size. For a sampled initial state xo, its trajectory is generated, 
and the transition tuples (X, 41, Uz, r+) are collected to form a replay buffer to 
update the action and critic networks. Concretely, the action network receives 
a state x, in time step t as input, and directly outputs a continuous action 
u: = k(x;|G*). The critic network takes the state x; and the action u; as input, 
and outputs a scalar Q-value Q(x;, u;|G@). For every m simulated time steps, we 
sample a batch of tuples from the buffer as the training data to update the actor 
and critic networks, until a certain prescribed termination condition is met for 
the learning process. The resulting actor network is the synthesized controller. 
All training related parameters, such as smoothing factor, are set as default. 
Our DDPG implementation is based on an open-source package DDPG [23]. 
The algorithm is outlined in Algorithm 1. 


Remark 2. The barrier certificate is initialized to be L, which means that the 
learner initially trains a DNN controller via standard reinforcement learning, 
without the aid of barrier certificates. 


3.2 Safety Verification with Barrier Certificates 


In the following, we focus on the verifier component of the proposed safe DNN 
synthesis framework, as described in Fig. 2, and show how to verify the safety of 
the closed-loop system under the DNN controller yielded from the learner. 


MSIR and BC |MSIR: ©, 
Computation |BC: B(x) 


Polynomial 
Inclusion 


Output: 
Success 


Updated B, (x) BC 
Refinement 


Verifier 


Fig. 2. The framework of the verifier. 


Shown in Fig.2, the learner produces a DNN controller k;(x). In order to 
make the problem of generating barrier certificates amenable to polynomial opti- 
mization problem, the verifier first employs Bernstein polynomial approximation 
to abstract the learned DNN controller as a polynomial one k;(x), with the asso- 
ciated abstract error € modeled as a bounded parameter, that is, u = k;(x) + €. 

By doing it, the safety of the closed-loop system under the DNN controller 
can be guaranteed via the existence of barrier certificates for the closed-loop 
system under the abstract controller. The verifier then performs bilinear matrix 
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inequalities (BMI) solving technique, to obtain a maximal safe initial region 
(MSIR)) O; and the corresponding barrier certificate B;(x). Once the computed 
MSIR ©; contains the given initial set ©, then the safety of the closed-loop 
system under the DNN controller u = k;(x) is verified. Otherwise, the verifier 
slightly adjusts the barrier certificate B;(x), based on quadratic programming 
solving, to gain an updated one B,(x), which can separate the unsafe region 
from the initial set. Then, the refined BC is fed to guide the learner to retrain 
a new DNN controller. 


Polynomial Abstraction of DNN Controllers. In the following, we con- 
sider the DNN controller with a single output, and for multiple-output cases, 
an extension is to approximate each output respectively. Formally, for a DNN 
controller k(x), we seek to compute an approximate polynomial p(x) € R[x] 
with a verified bound u E€ R+, such that 


|k(x) — p(x)| < u, Yx € Y, 


and the bound p is as small as possible. 

Weierstrass approximation theorem [7] asserts that a continuous function on 
a closed and bounded interval can be uniformly approximated on the interval 
by polynomials to any degree of accuracy. In this paper, we will compute the 
approximate polynomial based on the theory of Bernstein polynomials [9]. Let 
d = (di,--- ,dn) E N” and f : [0,1]” — R. The polynomial 


Bralx)= Y iG Tt (4 ai (1 — 2j)4-% 


<dj j=1 


is called the multivariate Bernstein polynomial of f. Theoretically, the Bernstein 
polynomial By a(x) converges uniformly to f for d;,--- ,dn — oo. In practice, 
the estimation of the approximation error bound is needed. As stated in [9], 
assume f is a Lipschitz continuous function over I : [0,1]” with a Lipschitz 
constant L, then we have 


Bras) - AF (DG) seer 


j=1 


Now, for the DNN controller k(x) over a domain W, we can apply the above 
method to obtain a Bernstein polynomial with a valid approximate error bound 
as its abstraction. Concretely, we first construct an interval enclosure for Y, and 
apply a linear transformation to map the interval enclosure onto the unit box J, 
then utilize Bernstein polynomial approximation to obtain an abstract polyno- 
mial controller k(x) + with € € [—y, u], where k(x) is a Bernstein polynomial of 
k(x) and p is its valid approximate error bound. Note that the fully-connected 
neural networks with sigmoid and tanh activation functions are Lipschitz con- 
tinuous, and the estimation of Lipschitz constants for deep neural networks has 
been studied in [14,31,34]. 
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Maximal Safe Initial Region Computation. Since k(x) +e enclosures k(x), 
the safety of the closed-loop system under the DNN controller k(x) can be guar- 
anteed via the existence of barrier certificates for the closed-loop system under 
the abstract controller k(x) + e. From this observation, we try to compute an 
MSIR ©, and its corresponding barrier certificate B(x), which can guarantee 
the safety of the closed-loop system under the abstract controller k(x) +e with 
respect to ©}. 

Firstly, we consider how to predefine a suitable initial state set template Oy 
from the given initial set O. In what follows, we provide some parametric initial 
state sets for two typical representations: Boxes and Euclidean ellipsoids (balls). 


Box Template. Suppose that the box initial set O is represented as 


where xe = (€1,-- ,¢n)’ is the center of the box, and b; € Rso. Then, the 
parametric initial set can be expressed as 


O, = {x € R”||| D7 (x — Xe) |loo < 7} 
where D = diag(b),--- ,bn) is a diagonal matrix. 


Ellipsoid Template. Suppose that the ellipsoid initial set O is expressed as a 
common representation: 


O = {x € R"|x = x. + Av, |lv|l2 < 1}, 


where x, is the center of the ellipsoid, and the matrix A is nonsingular. Then 
the parametric initial set can be expressed as 


O, = {x € R"|x = xo + 7 Av, ||vll2 < 1} 
= {x € RA“! (x — xo)ll2 < 7}. 


Without loss of generality, we can select the template of the parametric initial 
sets by taking the form 0, := {x € R”|g(x) < 7,i=1,...,mi} with y € Rso, 
where g(x) is the polynomial used to defined the prescribed initial set ©. 

In order to enlarge the safe initial region by choice of ©}, we maximize y 
while imposing the constraints for the existence of barrier certificates. Assume 
that the barrier certificate B(x) is a polynomial of degree at most d, whose 
coefficients form a vector space of dimension s(d) = Ca) with the canonical 
basis (x“) of monomials. Suppose the coefficients are unknown, and denote by 
b = (ba) E€ R°™ the coefficient vector of B(x), and write 


B(x,b) = > bax“ = > ba ti ra eTR" 
acN7 acNg 


in the canonical basis. Thus, the problem of computing an MSIR ©, of the 
closed-loop system under the abstract controller k(x) + € can be represented as 
an optimization problem 
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Yopt = MAXb,y Y 

s.t. B(x,b)>0, Vx € 0O}, (4) 
LtB(x,b) > 0, Vx € ¥ and B(x, b) = 0, 
B(x, b) <0, Yx € Xu. 


Then, Sum-of-Squares (SOS) relaxation technique is applied to encode the 
optimization problem (4) as a SOS program. In fact, given a basic semi-algebraic 
set K defined by: 


K= {x€ R” | pi(x) > 0,- , el) 2 O}, 


where p; € R[x],1 < j < s, a sufficient condition for the nonnegativity of the 
given polynomial f(x) on the semi-algebraic set K is provided as 


f(x) = o0(x) + p3 oi(x)pi(x), (5) 


where o; € X'|x]g, 1 < i < s. Thus, the representation (5) ensures that the 
polynomial f(x) is nonnegative on the given semi-algebraic set K. 

Observing (4), the polynomial £¢ B(x, b) is involved with the uncertain vari- 
able e in the range |—u, u], which can be written as the constraint, h(e) > 0 
with . 

Ale) := (e + u)(u — 6). 
Thus, the problem (4) can be transformed into the following optimization 
problem 


q* = maxp,y 7 

s.t. B(x, b) — o(x)(y — g(x)) € Xfx], . (6) 
L B(x, b) — A(x) B(x, b) — X; $; (x)hy (x) — v(x, e)h(e) — e1 € Xix], 
— B(x, b) — e2 — D2; Ky (x)aj (x) € Xfx], 


where €),€2 > 0, the entries of o(x), @;(x) K(x) € X[x], and v(x,e) € X[x,e], 
and A(x) € R[x]. Note that €1,€2 are needed to ensure positivity of polynomials 
as required in the second and third constraints in (4). Clearly, the feasibility of 
the constraints in (6) is sufficient to imply the feasibility of the constraints in (4), 
thus the optimum of (6) is a lower bound of the optimum of (4), i.e., Y* < Ype- 

The SOS program (6) is bilinear due to the product of the unknown coef- 
ficients of (B(x, b), A(x)) and (a(x), y), yielding a non-convex bilinear matrix 
inequalities (BMI) problem. Fortunately, a Matlab package PENBMI solver [22], 
which combines the (exterior) penalty and (interior) barrier method with the 
augmented Lagrangian method, can be applied directly to obtain a numerical 
solution of the problem (6). The solution y*,b* to problem (6) yields an MSIR, 
O,» and its corresponding barrier certificate B(x, b*). It means that the closed- 
loop system under the abstract controller k(x) + € is safe, with respect to Oy». 
Moreover, if the given initial set O is a subset of O,-, then the safety of the 
closed-loop system under the DNN controller k(x) with respect to O is verified. 
Otherwise, B(x, b*) will be further refined via quadratic programming method. 
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Remark 3. The gap between the optima of problems (4) and (6) decreases as 
increasing of degrees for the multiplier polynomials. The degree bound for the 
multiplier polynomials is exponential with the number of variables x and the 
degrees of the polynomials appearing in the semi-algebraic sets. In practice, we 
set up a truncated SOS programming for (6) by fixing a priori (much smaller) 
degree bound of all the unknown multiplier polynomials, to avoid high compu- 
tational complexity. 


Barrier Certificate Refinement. Consider the case in which the initial set O 
is not a subset of the MSIR O,-. In this case, the barrier certificate B(x, b*) can 
succeed to separate the unsafe region X, from ©,+, but it may fail to separate 
from O. In other words, B(x, b*) can not be regarded as a truly candidate barrier 
certificate with respect to O and Xu. Therefore, we will utilize the information 
of B(x, b*) to refine it, in order to obtain a new candidate barrier certificate that 
can separate O from Xu. Consider the change in B(x, b*) is expected as small 
as possible, the step of the barrier certificate refinement can be represented as 


min ||b — b*||2 
s.t. B(x,b) >0Vx€ 0, (7) 
B(x,b) < 0 Vx € X.a. 


By investigating (7), the constraints are the ones involving universal quanti- 
fiers. To avoid eliminating universal quantifiers directly, here we provide a relax- 
ation technique to deal with (7), which is based on selecting sampling points. For 
O and Xu, let us first construct rectangular meshes in O and X, respectively, 
with a mesh spacing r € Ry (say r = 0.05). The resulting mesh point sets are 
denoted as Qe and (2x,,, respectively. 

It is known that for a continuously differentiable function ¢(x) over a compact 
domain D, the mean value theorem yields that 


|o(x + Ax) — e(x)| < nq] Ax|lo0, 


where x,x + A € 92 are chosen randomly, and 7 = supy¢p ||V¢(x)||oo- Based on 
the above observation, the following implications are satisfied: 


B(x;,b) = ôi > 0, Vx; E€ Qe = B(x,b) 2 0 Vx € O, 
B(x;,b) + 62 <0, Vx; € Rx, => B(x, b) < 0 Yx € Xu. 


where 6; = nnr € R>o,i = 1,2 with m = supco || VB(x,b*) ||. and m2 = 
SUPE x, |V B(x, b*) loo. 

By using the above relaxation technique based on sampling points, (7) can 
be relaxed as the following problem 


min ||b — b*||2 
s.t. B(x;,b)— ô > 0, Vx; € Ne, (8) 
B(x;,b) +6<0, Vx; E Qx,, 
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which is a typical quadratic programming problem and can be solved by state- 
of-the-art solvers with great efficiency. 

Now, the refined B(x) =B (x, b) can separate O from X,,, but may still not 
satisfy the Lie derivative condition for barrier certificates. According to Theo- 
rem 1, B(x) is not a truly barrier certificate for the closed-loop system under 
the abstract controller k(x) + € with respect to © and X,,. Next, the refined 
B(x) will be further fed to guide the learner to retrain a new controller. To do 
it, we first consider the additional constraint for the Lie derivative of B (x), and 
apply barrier certificate guided reinforcement learning to compute a new DNN 
controller. 


4 Algorithm 


In Sect. 3, we have elaborated on the iteration-based safe controller synthesis 
method that iteratively co-synthesizes a DNN controller within the RL frame- 
work and a polynomial barrier certificate via BMI solving. Briefly, we describe 
the main implementation steps of our approach in the following Algorithm 2. 


Algorithm 2. SRLBC: Safe Reinforcement Learning with Barrier Certificate 
Input: The CCDS C; unsafe region Xu; maximum number of iterations maxIter 
Output: Safe DNN Controller k 
1: iter — 0 
2 BL 
3: while iter < mazIter do 
: k — Learning(f, O, Xu, B) 


4 
5: k, u — PolyInclusion(k) 

6: Ož, B(x, b*) — MaxSafeSet( f, k, u, O, Xu) 
T: if O C OF then 

8: return k 

9: end if 

10: B — RefineBarrier( B(x, b*), O, Xu) 
11: end while 


Algorithm 2 shows the iteration scheme of our safe controller synthesis, which 
guides the experiment implementation. The procedure takes as inputs a CCDS 
C, an unsafe region X,,, a maximum number of iterations magzIter, and returns 
a safe DNN controller of a given architecture. In a pass of the iteration, the 
implementation process has four steps as follows. 


(i) Apply the RL method to train a DNN controller. The learner introduced in 
Sect. 3.1 is implemented by Line 4 in Algorithm 2, and the barrier certificate 
is initialized to be L, which means that the learner trains a DNN controller 
via classical reinforcement learning, without the aid of barrier certificates 
in the initial pass; 
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(ii) For the closed-loop system under the DNN controller learned in Step (i), 
compute a maximal safe initial region (MSIR), with which a barrier cer- 
tificate exists. We use Bernstein polynomial approximation to compute a 
polynomial abstraction for the learned DNN controller by Line 5, and then 
compute an MSIR ©,» and the corresponding barrier certificate B(x, b*) 
by Line 6; 

(iii) Check the condition wether the MSIR O,» in Step (ii) contains the given 
initial set O. If O C O,-, then we terminate the loop with a verified safe 
DNN controller; otherwise go to Step (iv). This process refers to Lines 7-9; 

(iv) Slightly modify the barrier certificate from Step (iii) so that it separates 
the initial set and the unsafe region, and then go to Step (i) to learn a 
new controller by encoding the refined barrier certificate into the reward 
function. For this task, the barrier certificate B is refined via quadratic 
programming by Line 10. 


This inductive loop repeats until an MSIR enclosing ©. and its corresponding 
barrier certificate are computed or until a timeout is reached. 


Remark 4. Our procedure is sound, i.e. a valid output from the verifier is prov- 
ably correct. However, we cannot claim any completeness, since our procedure 
might in general not terminate because the existence of the barrier certificate 
is just a sufficient condition of the safety of the system, and such a barrier cer- 
tificate may not exist indeed. Once the procedure fails, we may improve the 
relaxation precision and then increase the possibility to find the barrier certifi- 
cate by increasing the degree bound for the multiplier polynomials in the SOS 
program (6). 


Furthermore, an example is used to depict how our safe controller synthesis 
algorithm works. 


Example 1. Consider the Van der Pol system 


ty] T2 
: — 1 
£9 =z + år] -r2 +u 


with the domain Y = {x € R? | —3 < 21, x2 < 3}. Our goal is to design a control 
law k such that all trajectories of the system under u = k(x, 22) starting from 
the initial set 

O = {x € R? | (a, — 1.5)? + z2 < 1.17} 


will never enter the unsafe set 
Xu = {x E€ R? | (£1 +1)? + (z2 +1) < 1}. 


We complete our goal by Algorithm 2, and provide the details here. At first, 
we apply the reinforcement learning method to train the initial neural network 
controller u = ko(x) in terms of the target of safety satisfiability, which is Step 
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(i) and refers to Line 4 in Algorithm 2, and then try to yield the barrier cer- 
tificate B(x). We compute polynomial abstraction of DNN Controller ko(x) via 
Bernstein polynomials which is Step (ii), where 


ko(x) = 0.01422; + 0.00922 — 0.02052? + 0.00772122 + 0.034022 
+ 0.02462? + 0.001827 — 0.0820xr; 72 + 0.043523 + €. 


with e € [—0.05,0.05], which is implemented by Line 5. Thus, the polynomial 
abstraction technique can yield an abstract polynomial system. 

Go on Step (ii) to compute a maximal safety region ©, and the corresponding 
barrier certificate B(x). In this case, we parameterize the initial set: 


O, = {x € R? | (z1 — 1.5)? + 23 < 4}. 


For the given abstract polynomial system with the parameterized initial set 
©., our goal is to maximize the radius y subject to the existence of a barrier 
certificate. By calling the PENBMI solver [22] we can obtain a barrier certificate 
Bo(x) with the maximal safe initial region Oo (Line 6 in our Algorithm 2), i.e., 


Oo = {x € R? | (zı — 1.5)? + 22 < 0.8132}, 
Bo(x) = 11.716 + 22.80642 + 21.536822 — 4.527327 + 13.80842 22 + 3.045323. 
(10) 


Thus, the safety of the system with the controller kg(x) with respect to the set 
Oo is guaranteed. Now the present controller ko(x) can not be safe for whole 
initial set ©, we continue to update controller and barrier certificate (Line 7-9). 

Let ko(x) and Bo(x) be the initial controller and the initial barrier certificate, 
we perform the iterative framework to synthesize the controller subject to the 
safety constraint. As shown in Fig.3(a), the zero level set of Bo(x) is the blue 
dashed line. Observing Fig. 3(a), Bo(x) can succeed to separate the unsafe region 
Xu (the red circle) from Oo (the green dashed circle), but not separate from the 
initial set ©, which means that Bo(x) can not be regarded as the truly barrier 
certificate. Therefore, one may perturb the coefficients of Bo(x) to obtain Bo(x) 
which can separate O and Xu. And this process corresponds to Step (iv) and 
Line 10 of our Algorithm 2. The perturbed polynomial is represented as 


Bo(x) = 10.5590 + 22.9401 21 + 18.2448r9 — 0.89542? + 14.4971 22 + 1.106022. 


As shown in Fig. 1(b), the zero level set of the barrier Bo(x) (the blue dash) 
separates X, (the red circle) from O (the green circle). According to the concept 
of barrier certificate and Theorem 1, Bo(x) is not a truly barrier certificate, 
since the condition of the Lie derivative of the barrier certificate is not satisfied. 
Accordingly, by using the Bo(x) and the initial controller k(x), we then try to 
retrain a control law with an additional constraint of the lie derivative for the 
barrier certificate Bo(x). Calling the learner module (Line 4), we update a new 
control law k1(x) represented as a two-hidden layer sigmoid-based DNN with 20 
neurons per layer by RL approach. 


482 Z. Yang et al. 


VANS SPOS RASS NNN 7 JAS 3S NNNNNN AZ VN NSS SETH AHNSNNNNNN SA 7 
V NNN ATES ARNNNNNN SNH OVNNS NNNNNNNH27 VV NNR HHH SNINA NAAN N27 
VV NNSA ESSAI ANNAN NSN H27 0 VNNS VV NNS SAS SNNAANNNNN HH 277 
LUNAASESS EN SAN NARS ASS VNS NN 
VVNN NNNNN N77 VAS 7 V\V\ NS 
INAN NANNA ati YANS 7 ENA NS 
\\\\ ANN 2il VVVN f VV\ SN 
CER \\ 1] \\V\ 1 yA XN 
YA 5 1] kug 1 \\ 5 
Li ` 1 1 1 Lt ` 
Vi hg = i] EEE j] pr = 
Lis \ bis \ EE, ` 
Ld A Í al Lis 
Ls Vit l Lie 
le AAA / MA bile 
2- VAN / EA Lf- 
N NA AAT NA ACA Len 
LJ EmNNANANANAINS SR RAANNA A MAANAAANAIN SS RR AANAA 
SOANNNANANAILSN RRR RAAAS NNNSNN 
ZOSNANANNSSS RE EE A N A ea RAA R R Sik SE ETE 
(a) (b) 


Fig. 3. This picture shows the iteration process of barrier certificate updating when we 
learn the safe controller. The red circles stand for unsafe regions, the blue curves stand 
for the zero level set of barrier certificates, and the green circles stand for the initial sets 
and safe initial sets. Subfigure (a) describes the intermediate results of maximal safe 
initial set Oo (the green dashed circle) with its associate barrier certificate Bo obtained 
from Line 6 in Algorithm 2 at the first iteration. We slightly modify the barrier function 
Bo to separate O and X, by Line 10 and obtain Bo which is the blue solid curve shown 
in Subfigure (b). Using Bo asa guide, a new controller is learned, from which a barrier 
certificate Bı is generated as shown in Subfigure (c). It can be shown that Bı is the 
real barrier certificate of the system. (Color figure online) 


Repeating the above abstraction technique and solving the BMI problem for 
finding the maximal safety initial set O1, we obtain the barrier certificate B, (x) 
with respect to O4, i.e., 


O1 = {x € R? | (x1 — 1.5)? + #3 < 1.2201}, 
Bı (x) = 10.3661 + 22.656921 + 17.7852x2 — 0.90372? + 14.183221 £2 + 0.947122. 


It is easy to check that the original initial set O is now a subset of O,, which 
means that Bı(x) is a truly barrier certificate. 


5 Experiments 


In this section, we first depict an example of three dimension nonlinear contin- 
uous system to show our algorithm by synthesizing a safe DNN controller for 
it, and then present an experimental evaluation of our algorithm over a set of 
benchmark examples by comparing with a DNN controller learning framework 
called nncontroller in [39]. 


Example 2. Consider the continuous dynamical system 


£i £3 + 8x2 
gaj = —T2 + T3 
£3 =23.— z? +u 


with the domain 
Y = {x E€ R? |x] + 234+ 22 < 16}. 
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Our goal is to design a control law k such that all trajectories of the closed-loop 
system under u = k(21, £2, £3) starting from the initial set 


O = {x E R? |r? +r} +r? <1} 
will never enter the unsafe set 


Xu = {x € R? | (a, — 2.1)? + (z2 — 2.1)? + (z3 — 2.1) < 1.87}. 


It suffices to synthesize a control law k and a barrier certificate B(x) with 
the maximal safe initial region ©, such that O C O,. Suppose that the DNN 
controller k is represented as a five-hidden layer sigmoid activated DNN with 
30 neurons per layer. We first call the learner to train a DNN controller, and 
then call the verifier to compute the maximal safe initial region ©, and its cor- 
responding barrier certificate B(x). After two iterations, we successfully obtain 
a safe DNN controller, and the following barrier certificate 


B(x) = 220.1981 — 45.7322xı — 40.2831x3 — 218.476523 + 4.957527 
+ 38.72882122 — 9.8224xr1 r3 — 66.839822 + 17.25622223 + 18.396722. 
(12) 


As shown in Fig.4, the zero level set of the barrier certificate B(x) (the blue 
surface) separates X,, (the red ball) from all trajectories starting from O (the 
green ball). Therefore, the safety of the above system is verified. 


Fig. 4. Phase portrait of the system in Example 2. The zero level set of the barrier 
certificate B(x) (the blue surface) separates Xu (the red ball) from all trajectories 
starting from © (the green ball). (Color figure online) 


We have implemented a safe controller synthesis tool called SRLBC based 
on Algorithm 2, with Tensorflow 1.14 for the DNN controller synthesis and a 
Matlab package PENBMI [22] for barrier certificate generation. Table 1 shows the 
performance evaluation of our SREBC and nncontroller in [39] on 12 continuous 
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systems. All experiments are conducted on a machine running Windows 10 with 
16GB RAM, a 3.20 GHz AMD Ryzen 7 3700X CPU, and an NVIDIA GeForce 
GTX 1650 super GPU. 

In Table 1, the origins of these 12 examples are provided in the first column; 
dg denotes the maximal degree of the polynomials in the vector fields; nx denotes 
the number of the state variables; L and N refer to the numbers of hidden layers 
and the neurons per each hidden layer, respectively; tı and t2 denote the time 
spent by SREBC and nncontroller in seconds, respectively; the symbol ‘—’ means 
that nncontroller was unable to return a safe DNN controller within 10,000s. 


Table 1. Performance evaluation 


Examples | dy | nx | NNstructure | SRLBC nncontroller 
L| N degB(x) t(s) NN-type BC | t(s) 

C1 [28] 2 |2 |4|20 2 54.77 | 2-10-1 20.52 
C2 [6] 3 |2 |4|20 2 37.54 | 2-10-1 8.46 
C3 [6] 3 |2 |4 | 20 2 35.99 | 2-10-1 6.77 
C4 [27 3 |2 |4 | 20 4 38.68 | 2-10-1 6.88 
C5 [39 3 |3 |5 | 30 2 56.21 | 3-10-1 32.19 
C6 [20 3 |4 |5 | 30 2 45.54 | 4-10-1 78.52 
C7 [6] 3 |4 |5 | 30 4 40.82 | 4-10-1 184.85 
C8 [32 2 |5 |5130 2 423.11 | 5-20-1 2217.41 
C9 [38 2 |6 |5 | 30 2 383.26 | — E 

C10 [4 3 |6 |5 |30 4 942.74 | — - 

C11 [21] |2 |7 |5 |30 2 1829.46 | — = 

C12 [21] |2 |9 |5 |30 2 6208.79 | — = 


Table 1 shows that for the 12 examples, our SRLBC manages to handle all 
of them within 3 iterations, while nncontroller can only deal with 8 successfully. 
Especially for the four examples from C9 to C12 whose dimensions exceed 5, 
nncontroller fails to synthesize safe controllers within specified time bound after 
various attempt. We have tried different network structures with the number 
of hidden layers varies from 1 to 5 and the number of hidden neurons chosen 
among {10, 20, 30,40}, the nncontroller fails to train candidate DNN controllers 
and barrier certificates within the time limit, whereas our SRLBC can yield safe 
controllers, represented as five-layer sigmoid activated neural networks. 

Consider the efficiency of our SREBC and nncontroller in terms of the time 
spent in synthesizing safe DNN controllers for shared examples. On average, 
our SRLBC takes 91.58s to synthesize a safe DNN controller while nncontroller 
needs 323.28, which is about 3.53 times slower than our SRLBC. Despite the 
network structures used for SRLBC is more complex than that for nncontroller, 
and the number of neural network neurons of SRLBC is much more than that 
of nncontroller, we could synthesize more efficiently. 
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Obviously, our SRLBC scales better than nncontroller for the considered 
examples. Although our SRZBC consumes a little more time than nncontroller 
for the systems with dimension 2 or 3, our tool shows its advantage on time 
consuming when handling the systems with dimension higher than 3 (C6-C8) 
and its ability on examples C9-C12. Comparing with nncontroller which is also 
a data driven approach, SRLBC inherits the advantage in learning efficiency 
of reinforcement learning, whereas the size of the training data for nncontroller 
increases exponentially with the dimension of the considered systems, which 
greatly limits the scale of the problem to deal with. Beyond Table 1, we have 
tried an example of nonlinear polynomial system [16] with dimension up to 12, 
and SRLBC yields successfully a result in 54,314s while nncontroller fails. It is 
clear that our approach is able to attack large-scale problems. 

During the experiment, we have observed that SRLBC obtains the near- 
safe controllers at the first iteration for most examples, and the remaining work 
is to refine barrier certificates slightly and use them to guide and adjust the 
controllers. In fact, the numbers of the iterations in our experiments on the 
benchmarks did not exceed 3 for all cases. These observations show that our 
iterative scheme of safe reinforcement learning converges well in practice, because 
the refinement of the controllers could utilize the intermediate learned results 
before we get the final results. In addition, SRLBC could easily generalize to 
deal with non-polynomial systems and it has successfully solved the classical 
continuous Cartpole system [3], which would be presented in the future work. 


6 Related Work 


Our work on synthesizing DNN controllers for safety control of nonlinear systems 
is mainly related to two categories of research, i.e. formal verification of nonlinear 
systems with DNN controller and safe DNN controller synthesis. There has been 
considerable research conducted in these areas because of the applications in 
safety critical systems in recent years. 


Formal Verification of Nonlinear Systems with DNN Controller. One 
of the mainstream methodologies is through constructing over-approximations 
to the reachable sets of the system trajectories under DNN controllers. And 
the core technique first focuses on output range analysis of the neural network 
components, then combines the output range with reachability analysis on the 
dynamical systems. For instance, based on the output range analysis in [13], 
Dutta et al. verified the feedback control systems with DNN controllers using 
mixed-integer linear programming [12]. And they implemented the prototype 
tool for the neural rule generation inside the tool termed as Sherlock, and used 
it together with Flow* for computing the reach sets of the systems [10]. 

The difference of works on this direction lies in what kind of abstract domains 
is adopted for output range analysis of the neural network components. A recent 
attempt involves the work of Xiang et al. that computes the output ranges as 
a union of convex polytopes [37]. For the piecewise linear systems with ReLU 
neural network as the controller, they compute the output range of ReLU neural 
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network by a layer-by-layer approach. Dutta et al. propose an approach to 
abstract the DNN by a local polynomial approximation along with rigorous error 
bound, and then integrate it with a Taylor model-based flow pipe construction 
scheme for continuous differential equations to derive the over-approximation of 
the real reachable set [11]. Likely, Huang et al. present an approach to construct- 
ing a polynomial approximation for a DNN controller using Bernstein polyno- 
mials, and then integrate result with the plant to get the over-approximated 
reachable set [18]. There is a different route for reachability of systems with neu- 
ral network components proposed by Ivanov et al. and termed as Verisig [19]. 
It transforms the problem of verifying neural network controlled system into a 
hybrid system verification problem by first transforming a sigmoid-based neural 
network into an equivalent hybrid system and then composing it with the plant. 

Instead of computing reachable sets, a different approach for verifying neural 
network controlled systems is through barrier certificate synthesis. Tuncali et 
al. synthesize candidate barrier certificates using simulation-guided techniques, 
and then verify the overall system safety by checking the validity of the barrier 
certificate conditions for the candidate [35]. The safety property was proofed, or 
a counterexample was returned to updated candidate barrier certificates. 


Safety Critical Controller Generation. Research works in this category 
differ in: (1) the overall learning framework, e.g. reinforcement learning (RL) 
or supervised learning; (2) the kind of safety certificate, e.g., control Lyapunov 
function (CLF) or control barrier function (CBF) [2]. 

For CLFs or CBFs synthesis, a demonstrator-learner-verifier framework was 
proposed in [29] to learn polynomial CLFs for polynomial nonlinear dynamical 
systems; a special type of neural network was designed in [30] as candidates for 
learning Lyapunov functions; a supervised learning approach was proposed in 
[5] to learn neural network Lyapunov functions and linear control policies; data- 
driven model predictive control (MPC) exploiting neural Lyapunov function and 
neural network dynamics model was proposed in [12,25]. For multi-agent sys- 
tems, barrier function has recently been applied for safe policy synthesis on 
POMDP models [1]. The computer science community has dealt with the issue of 
safe controller learning in different ways from above: for example, a logical-proof 
based approach was proposed in [15] towards safe RL; a synthesis framework 
capable of synthesizing deterministic programs from neural network policies was 
proposed in [41] and so formal verification techniques for traditional software 
systems can be applied. Compared with these works, [39] learn controllers based 
on neural networks. To certify the safety property they utilize barrier certificates, 
which are represented by DNNs as well. In this way, they train DNN controllers 
and DNN barrier certificates simultaneously, achieving a verification-in-the-loop 
synthesis. Liu et al. proposed a Recurrent Neural Network (RNN) framework 
to synthesize feedback control policies for a system under STL specifications 
[24]. The CBF was used to modify the control policies predicted by the RNN to 
guarantee safety. 
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7 Conclusion 


In this paper, we have developed a novel scheme for synthesizing safe controllers 
of nonlinear systems with control against safety constraints. It employs an iter- 
ative architecture, where a learner trains DNN controllers using reinforcement 
learning and a verifier checks them via computation of maximal safe initial 
regions and the corresponding barrier certificates, based on polynomial abstrac- 
tion and bilinear matrix inequalities solving. The key idea in this paper is to 
use an alternating co-synthesis scheme of controllers and barrier certificates to 
generate safe controllers, which could refine barrier certificates during iteration. 
On the one hand, this synthesis scheme has inherited the higher learning effi- 
ciency from RL technique than other data driven methods. On the other hand, 
this iterative architecture could modify barrier certificates to obtain an adap- 
tive one along with DNN controller retraining, and other verification-in-the-loop 
synthesis methods are usually based on user-defined barrier functions. Further- 
more, our BMI solving based barrier certificate generation is more efficient than 
SMT based verification. The experimental results demonstrate that our method 
is more scalable and effective than the existing DNN controller synthesis method 
nncontroller. 


References 


1. Ahmadi, M., Singletary, A., Burdick, J.W., Ames, A.D.: Safe policy synthesis in 
multi-agent POMDPs via discrete-time barrier functions. In: Proceedings of the 
IEEE 58th Conference on Decision and Control (CDC), pp. 4797-4803. IEEE 
(2019) 

2. Ames, A.D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K., Tabuada, 
P.: Control barrier functions: theory and applications. In: Proceedings of the 17th 
European Control Conference, (ECC), pp. 3420-3431 (2019) 

3. Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can 
solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13(5), 
834-846 (1983) 

4. Bouissou, O., Chapoutot, A., Djaballah, A., Kieffer, M.: Computation of paramet- 
ric barrier functions for dynamical systems using interval analysis. In: Proceedings 
of the 53rd IEEE Conference on Decision and Control (CDC), pp. 753-758. IEEE 
(2014) 

5. Chang, Y.C., Roohi, N., Gao, S.: Neural Lyapunov control. In: Proceedings of 
the Annual Conference on Advances in Neural Information Processing Systems 
(NeurIPS), pp. 3245-3254 (2019) 

6. Chesi, G.: Computing output feedback controllers to enlarge the domain of attrac- 
tion in polynomial systems. IEEE Trans. Autom. Control 49(10), 1846-1853 (2004) 

7. Davis, P.J.: Interpolation and Approximation. Dover Books on Mathematics. Dover 
Publications, New York (1975) 

8. Deshmukh, J.V., Kapinski, J., Yamaguchi, T., Prokhorov, D.: Learning deep neural 
network controllers for dynamical systems with safety guarantees: Invited paper. 
In: Proceedings of the IEEE/ACM International Conference on Computer-Aided 
Design (ICCAD), pp. 1-7 (2019) 


488 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


Z. Yang et al. 


. Duchoň, M.: A generalized bernstein approximation theorem. Tatra Mt. Math. 


Publ. 49(1), 99-109 (2011) 

Dutta, S., Chen, X., Jha, S., Sankaranarayanan, S., Tiwari, A.: Sherlock - a tool 
for verification of neural network feedback systems: demo abstract. In: Proceedings 
of the 22nd ACM International Conference on Hybrid Systems: Computation and 
Control (HSCC), pp. 262-263 (2019) 

Dutta, S., Chen, X., Sankaranarayanan, S.: Reachability analysis for neural feed- 
back systems using regressive polynomial rule inference. In: Proceedings of the 
22nd ACM International Conference on Hybrid Systems: Computation and Con- 
trol (HSCC), pp. 157-168 (2019) 

Dutta, S., Jha, S., Sankaranarayanan, S., Tiwari, A.: Learning and verification of 
feedback control systems using feedforward neural networks. IFAC-PapersOnLine 
51(16), 151-156 (2018) 

Dutta, S., Jha, S., Sankaranarayanan, S., Tiwari, A.: Output range analysis for 
deep feedforward neural networks. In: Dutle, A., Muñoz, C., Narkawicz, A. (eds.) 
NFM 2018. LNCS, vol. 10811, pp. 121-138. Springer, Cham (2018). https://doi. 
org/10.1007/978-3-319-77935-5_9 

Fazlyab, M., Robey, A., Hassani, H., Morari, M., Pappas, G.J.: Efficient and accu- 
rate estimation of lipschitz constants for deep neural networks. arXiv preprint 
arXiv:1906.04893 (2019) 

Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe 
control through proof and learning. In: Proceedings of the Thirty-Second AAAI 
Conference on Artificial Intelligence (AAAI), pp. 6485-6492 (2018) 

Gao, S.: Quadcopter model. https: //github.com/dreal/benchmarks 

Garcia, J., o Fernandez, F., et al.: A comprehensive survey on safe reinforcement 
learning. J. Mach. Learn. Res. 16(42), 1437-1480 (2015) 

Huang, C., Fan, J., Li, W., Chen, X., Zhu, Q.: ReachNN: reachability analysis of 
neural-network controlled systems. ACM Trans. Embedded Comput. Syst. 18(5s), 
106:1-106:22 (2019) 

Ivanov, R., Weimer, J., Alur, R., Pappas, G.J., Lee, I.: Verisig: verifying safety 
properties of hybrid systems with neural network controllers. In: Proceedings of 
the 22nd ACM International Conference on Hybrid Systems: Computation and 
Control (HSCC), pp. 169-178 (2019) 

Jarvis-Wloszek, Z.: Lyapunov based analysis and controller synthesis for polyno- 
mial systems using sum-of-squares optimization. Ph.D. thesis, University of Cali- 
fornia (2003) 

Klipp, E., Herwig, R., Kowald, A., Wierling, C., Lehrach, H.: Systems Biology in 
Practice: Concepts. Implementation and Application, Wiley-Blackwell (2005) 
Koévara, M., Stingl, M.: PENBMI user’s guide (version 2.0) (2005). http://www. 
penopt.com 

Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: Pro- 
ceedings of the 4th International Conference on Learning Representations (ICLR) 
2016 

> e Mehdipour, N., Belta, C.: Recurrent neural network controllers for signal 
temporal logic specifications subject to safety constraints (2020). https://arxiv. 
org/abs/2009.11468 

Mittal, M., Gallieri, M., Quaglino, A., Salehian, S.S.M., Koutník, J.: Neural Lya- 
punov model predictive control (2020). https://arxiv.org/abs/2002.10451 

Prajna, S., Jadbabaie, A., Pappas, G.J.: A framework for worst-case and stochastic 
safety verification using barrier certificates. IEEE Trans. Autom. Control 52(8), 
1415-1429 (2007) 


An Iterative Scheme of Safe Reinforcement Learning for Nonlinear Systems 489 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


Prajna, S., Parrilo, P.A., Rantzer, A.: Nonlinear control synthesis by convex opti- 
mization. IEEE Trans. Autom. Control 49(2), 310-314 (2004) 

Pylorof, D., Bakolas, E.: Analysis and synthesis of nonlinear controllers for input 
constrained systems using semidefinite programming optimization. In: Proceedings 
of the 2016 American Control Conference (ACC), pp. 6959-6964 (2016) 
Ravanbakhsh, H., Sankaranarayanan, S.: Learning control Lyapunov functions 
from counterexamples and demonstrations. Auton. Rob. 43(2), 275-307 (2019) 
Richards, S.M., Berkenkamp, F., Krause, A.: The Lyapunov neural network: adap- 
tive stability certification for safe learning of dynamic systems (2018). http://arxiv. 
org/abs/1808.00924 

Ruan, W., Huang, X., Kwiatkowska, M.: Reachability analysis of deep neural net- 
works with provable guarantees. In: Proceedings of the Twenty-Seventh Interna- 
tional Joint Conference on Artificial Intelligence (IJCAI), pp. 2651-2659 (2018) 
Sassi, M.A.B., Sankaranarayanan, S.: Stabilization of polynomial dynamical sys- 
tems using linear programming based on bernstein polynomials (2015). arXiv 
preprint arXiv:1501.04578 

Squires, E., Pierpaoli, P., Egerstedt, M.: Constructive barrier certificates with 
applications to fixed-wing aircraft collision avoidance. In: Proceedings of the 
IEEE Conference on Control Technology and Applications (CCTA), pp. 1656-1661 
(2018) 

Szegedy, C., et al.: Intriguing properties of neural networks. In: Proceedings of the 
2nd International Conference on Learning Representations (ICLR) (2014) 
Tuncali, C.E., Kapinski, J., Ito, H., Deshmukh, J.V.: Reasoning about safety of 
learning-enabled components in autonomous cyber-physical systems. In: Proceed- 
ings of the 55th Annual Design Automation Conference (DAC), pp. 30:1-30:6 
(2018) 

Turchetta, M., Kolobov, A., Shah, S., Krause, A., Agarwal, A.: Safe reinforcement 
learning via curriculum induction. In: Proceedings of the Annual Conference on 
Advances in Neural Information Processing Systems (NeurIPS), pp. 12151-12162 
(2020) 

Xiang, W., Tran, H.D., Rosenfeld, J.A., Johnson, T.T.: Reachable set estimation 
and safety verification for piecewise linear systems with neural network controllers. 
In: Proceedings of the Annual American Control Conference (ACC), pp. 1574-1579 
(2018) 

Zeng, X., Lin, W., Yang, Z., Chen, X., Wang, L.: Darboux-type barrier certificates 
for safety verification of nonlinear hybrid systems. In: Proceedings of the 2016 
International Conference on Embedded Software (EMSOFT), pp. 1-10 (2016) 
Zhao, H., Zeng, X., Chen, T., Liu, Z., Woodcock, J.: Learning safe neural network 
controllers with barrier certificates. In: Proceedings of the International Sympo- 
sium on the Dependable Software Engineering. Theories, Tools, and Applications 
(SETTA), pp. 177-185 (2020) 

Zhao, H., Zeng, X., Chen, T. Liu, Z., Woodcock, J.: Learning safe neural network 
controllers with barrier certificates. Formal Aspects Comput., 1-19 (2021). https:// 
doi.org/10.1007/s00165-021-00544-5 

Zhu, H., Xiong, Z., Magill, S., Jagannathan, S.: An inductive synthesis framework 
for verifiable reinforcement learning. In: Proceedings of the 40th ACM SIGPLAN 
Conference on Programming Language Design and Implementation (PLDI), pp. 
686-701 (2019) 


490 Z. Yang et al. 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the 
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


S 


Check for 
updates 


HYBRIDSYNCHAADL: Modeling and Formal 
Analysis of Virtually Synchronous CPSs in AADL 


Jaehun Leet, Sharon Kim!, Kyungmin Bae!®9 ©, 
and Peter Csaba Olveczky?® 


t Pohang University of Science and Technology, Pohang, Korea 
kmbae@postech.ac.kr 


2 University of Oslo, Oslo, Norway 


Abstract. We present the HYBRIDSYNCHAADL modeling language and 
formal analysis tool for virtually synchronous cyber-physical systems 
with complex control programs, continuous behaviors, bounded clock 
skews, network delays, and execution times. We leverage the Hybrid 
PALS equivalence, so that it is sufficient to model and verify the simpler 
underlying synchronous designs. We define the HyBRIDSYNCHAADL 
language as a sublanguage of the avionics modeling standard AADL 
for modeling such designs in AADL, and demonstrate the effectiveness 
of HyBRIDSYNCHAADL on a number of applications. 


1 Introduction 


Many cyber-physical systems (CPSs) are virtually synchronous networks of hy- 
brid components with continuous behaviors combined with sophisticated con- 
trollers. They should logically behave as if they were synchronous—in each iter- 
ation of the system, all components, in lockstep, read inputs and perform tran- 
sitions which generate outputs for the next iteration—but have to be realized 
in a distributed setting, with clock skews and message passing communication. 
Examples of such CPSs include avionics and automotive systems [34,42], net- 
worked medical devices [5,30], and other distributed control systems such as the 
steam-boiler benchmark [1], where the underlying infrastructure often guaran- 
tees bounds on clock skews, network delays, and local execution times. 

The uptake of automated formal analysis of such CPSs is challenging, since: 


1. The combination of large “discrete” state spaces, caused by interleavings due 
to asynchronous communication, and continuous behaviors, taking into ac- 
count clock skews, network delays, and sampling/actuation times (based on 
imprecise clocks) makes direct automatic model checking analysis infeasible. 

2. To enable formal analysis to a large user base, the modeling language for 
such CPSs, with complex control programs, should be well-known for CPS 
developers, and should be integrated into mature modeling environments. 


To confront these challenges, we present in this paper the HyYBRIDSYNCH- 
AADL modeling language and analysis tool, which address them as follows: 
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1. To dramatically reduce both the modeling complexity and the state space 
caused by asynchronous communication, we use the Hybrid PALS equiva- 
lence [8], which says that the underlying synchronous design—where all com- 
ponents execute in lockstep, and there is no asynchronous message passing— 
satisfies the same properties as the asynchronous distributed system. 

2. The HyBRIDSYNCHAADL modeling language is a subset of the avionics 
modeling standard AADL [22] and its behavioral annex to model control 
programs, and captures a synchronous subset of AADL with continuous 
behaviors. We have also integrated modeling and formal analysis of HYBRID- 
SyYNCHAADL models into the OSATE modeling environment for AADL. 


Providing formal semantics and analysis for HYBRIDSYNCHAADL, with its 
expressive control program formalism, continuous behaviors, and clock skews, 
and having to cover all possible continuous behaviors based on imprecise clocks, 
is challenging. We combine Maude [19] and the SMT solver Yices [21] to provide 
such a semantics, as well as symbolic reachability analysis of bounded invariant 
properties. To make the analysis feasible, our tool also implements a state-space 
reduction method that merges symbolic states for Maude-with-SMT to signifi- 
cantly improve the performance of symbolic reachability analysis. We illustrate 
the use of the HYBRIDSYNCHAADL language and tool—and compare its effec- 
tiveness with other state-of-the-art CPS analysis tools—on a number of hybrid 
CPS applications, including distributed drones that communicate to reach the 
“same” location, or fly in formation, without crashing into each other. 

Our tool extends the SynchAADL tool [7,9,10] for distributed real-time sys- 
tems without continuous behaviors, where the time when an event takes place 
can be abstracted away, so there is no need to consider clock skews, and any 
(sufficiently expressive) explicit-state model checker can be applied. In contrast, 
HYBRIDSYNCHAADL must model continuous behaviors and clock skews, and 
must analyze all possible behaviors based on when the continuous components 
are sampled and actuated, which depend on the imprecise local clocks. The tool 
is available at https://hybridsynchaadl.github.io. 


2 Preliminaries 


PALS and Hybrid PALS. When the infrastructure guarantees bounds I’ on clock 
skews, network delays, and execution times, the PALS pattern [4,36] reduces 
the problems of designing and verifying virtually synchronous distributed real- 
time systems to the much simpler problems of designing and verifying their 
underlying synchronous designs: Given a synchronous system design SD, bounds 
I’, and a period p of each round, the PALS transformation gives the asynchronous 
distributed real-time system PALS(SD, I, p), which is stuttering bisimilar to SD. 

The synchronous design SD is formalized as the synchronous composition of 
state machines with input and output ports [36]. In each iteration, all machines 
simultaneously perform a transition, which includes reading inputs, changing the 
local state, and generating outputs (for the next iteration). 
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Hybrid PALS [8] extends PALS to virtually synchronous CPSs with physical 
environments that exhibit continuous behaviors. The physical environment Em 
of a machine M has real-valued parameters ¥ = (2,...,2,). The continuous 
behaviors of # are modeled by ordinary differential equations (ODEs) that specify 
different trajectories on Z. Em also defines which trajectory the environment 
follows, as a function of the last control command received by Em. 

The local clock of a machine M can be seen as a function cm : R>p > R>o, 
where cm (t) is the value of the local clock at time t, with Vt € Rso, |eas(t)—t| < € 
for e > 0 the maximal clock skew [36]. In its ith iteration, a controller M samples 
the values of its environment at time cm (i: p)+ts, where ts is the sampling time, 
and then executes a transition. As a result, the new control command is received 
by the environment at time cm(i- p) + ta, where ta is the actuating time. 


AADL. The Architecture Analysis & Design Language (AADL) [22] is an indus- 
trial modeling standard used in avionics, aerospace, automotive, medical devices, 
and robotics to describe an embedded real-time system. In AADL, a component 
type specifies the component’s interface (e.g., ports) and properties (e.g., peri- 
ods), and a component implementation specifies its internal structure as a set of 
subcomponents and a set of connections linking their ports. An AADL construct 
may have properties describing its parameters, declared in property sets. The 
OSATE modeling environment provides a set of Eclipse plug-ins for AADL. 

An AADL model describes a system of hardware and software components. 
Software components include threads that model the application software and 
data components representing data types. System components are the top-level 
components. A port is a data port, an event port, or an event data port. A 
component can have different modes and mode-specific property values, sub- 
components, etc. Mode transitions are triggered by events. 

Thread behavior is modeled as a guarded transition system with local vari- 
ables using AADL’s Behavior Annex [23]. When a thread is activated, transitions 
are applied until a complete state is reached. The dispatch protocol determines 
when a thread is executed. A periodic thread is activated at fixed time intervals. 


Maude with SMT. Maude [19] is a language and tool for formally specifying and 
analyzing distributed systems in rewriting logic. System states are specified as 
elements of algebraic data types, and transitions are specified using rewrite rules. 
In addition to its explicit-state analysis methods for concrete states, Maude pro- 
vides SMT solving and symbolic reachability analysis for constrained terms ¢ || t, 
which symbolically represent all instances of the term t(x1,... , £n) satisfying the 
constraint ¢(@1,...,%n) [40], using connections to Yices2 [21] and CVC4 [14]. 


3 The HYBRIDSYNCHAADL Modeling Language 


This section presents the HYBRIDSYNCHAADL language for modeling virtually 
synchronous CPSs in AADL. HYBRIDSYNCHAADL can specify environments 
with continuous dynamics, synchronous designs of distributed controllers, and 
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nontrivial interactions between controllers and environments with respect to 
imprecise local clocks and sampling and actuation times. 

The HyBRIDSYNCHAADL language is a subset of AADL extended with 
property set Hybrid_SynchAADL. We use a subset of AADL without changing 
the meaning of AADL constructs or adding a new annex—the subset has the 
same meaning for synchronous models and distributed implementations—so that 
AADL experts can easily develop and understand HyBRIDSYNCHAADL models. 


property set Hybrid_SynchAADL is 
Synchronous: inherit aadlboolean applies to (system); 
isEnvironment: inherit aadlboolean applies to (system); 
ContinuousDynamics: aadlstring applies to (system); 
Max_Clock_Deviation: inherit Time applies to (system); 
Sampling_Time: inherit Time_Range applies to (system); 
Response_Time: inherit Time_Range applies to (system); 

end Hybrid_SynchAADL ; 


Environment Components. An environment component models real-valued state 
variables that continuously change over time. State variables are specified using 
data subcomponents of type Base_Types: : Float. Each environment component 
declares the property Hybrid_SynchAADL: :isEnvironment => true. 

An environment component can have different modes to specify different 
continuous behaviors (trajectories). A controller command may change the mode 
of the environment or the value of a variable. The continuous dynamics in each 
mode is specified using either ODEs or continuous real functions as follows: 


Hybrid_SynchAADL: :ContinuousDynamics => 
' in modes (mode1), ..., "dynamics,, 


n 


"dynamics,' in modes (moden); 


In HYyBRIDSYNCHAADL, a set of ODEs over n variables £1,..., £n, Say, 


dz: = e;(£1,..., Zn) for i = 1,...,n, is written as a semicolon-separated string: 


d/dt (z1) = e1(£1,..., £n); --. 3 d/dt(£n) = en(£1,.--, 2n); 


If a closed-form solution of ODEs is known, we can directly specify concrete 
continuous functions, which are parameterized by a time parameter t and the 
initial values x1(0),...,2,(0) of the variables z1,..., £n: 


xi(t) = e1(t,21(0),...,an(0)); ... 3 ant) = en(t,x1(0),...,¢n(0)); 


An environment component interacts with discrete controllers by sending its 
state values, and by receiving actuator commands that may update state vari- 
ables or trigger mode (and hence trajectory) changes. This behavior is specified 
in HyBRIDSYNCHAADL using connections between ports and data subcompo- 
nents. A connection from a data subcomponent d inside the environment to an 
output data port o declares that the value of d is “sampled” by a controller. A 
connection from an environment’s input port i to d declares that a controller 
command arrived at į updates the value of the data subcomponent d. 
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Controller Components. Discrete controllers are usual software components in 
the Synchronous AADL subset [7,9]. A controller component is specified using 
the behavioral and structural subset of AADL: hierarchical system, process, 
thread components, and thread behaviors defined by the Behavior Annex [23]. 

A controller receives the state of the environment at some sampling time, 
and sends a controller command to the environment at some actuation time. 
Sampling and actuation take place according to the local clock of the controller. 


Hybrid_SynchAADL: :Max_Clock_Deviation => time; 
Hybrid_SynchAADL: :Sampling_Time => lower bound .. upper bound; 
Hybrid_SynchAADL: :Response_Time => lower bound .. upper bound; 


The top-level system component declares the following properties to state 
that the entire model is a synchronous design with a period T: 


Hybrid_SynchAADL: : Synchronous => true; Period => T; 


Communication. In HYBRIDSYNCHAADL, connections are constrained for syn- 
chronous behaviors: no connection is allowed between environments, or between 
environments and the enclosing system components. 

All (non-actuator) outputs of controller components generated in an iteration 
are available to the receiving controller components at the beginning of the next 
iteration. As explained in [7,9], delayed connections between data ports meet 
this requirement. Therefore, two controller components can be connected only 
by data ports with delayed connections: Timing => Delayed. 

Interactions between a controller and an environment occur instantaneously 
at the sampling and actuating times of the controller. Because an environment 
does not “actively” send data for sampling, every output port of an environment 
must be a data port, whereas its input ports could be of any kind. 


4 The HYBRIDSYNCHAADL Tool 


This section introduces the HYBRIDSYNCHAADL tool supporting the modeling 
and formal analysis of HyBRIDSYNCHAADL models. The tool is an OSATE 
plugin which: (i) provides an intuitive language to specify properties of models, 
(ii) synthesizes a rewriting logic model from a HYBRIDSYNCHAADL model, and 
(iii) performs various formal analyses using Maude combined with SMT solving. 


Specifying Properties. The tool’s property specification language allows the user 
to specify time-bounded invariant and reachability properties as propositional 
formulas whose atomic propositions are AADL Boolean expressions. 

A “named” atomic proposition can be declared in HYBRIDSYNCHAADL as 
follows, where each identifier is fully qualified with its component path: 


proposition [id]: AADL Boolean Expression 
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Navigate Search Project Run OSATE Analyses HybridSynchAADL Window Help 


runtime-osate2 - FourDronesSystem/propertysets/Fillee E AAA pameice.pspc - OSATE2 


ES ORR: Hi? 7 ¥-iAa fam. Code Generation 36 SPS 2 
Formal Analysis Symbolic Reachability 
|=) FourDroneSystem.aadl X =| FourDronesSystem_impl_InstanisE Tale CE EE 
© system implementation FourDronesSystem. impl reachability [rendezvous] MIn OOE Te 500; 
subcomponents 


dri: system Drone: :Drone. impl; invariant [safety] : ?initial ==> not(?collision) in time 500; 
dr2: system Drone: :Drone. impl; 
dr3: system Drone: :Drone. impl; al i i 
dra: system Drone: : Drone. imp1; E FourDronesSystem_FourDronesSystem_impl.aadl_diagram 3 
connections 4 
C1: port dr1.0X -> dr2.iX; fy dr2* fy dr3* : 
C2: port dr2.0X -> dr3. iX; oY] iY oY iY oX ix 
C3: port dr3.0X -> dr4. 

Properties [| AADL Property Values Classifier Information ® HybridSynchAADL Result X = {is 
PSPC File Property Id Result Method  CPUTime RunningTime Location 
FourDronesSystem_impl_Instance.pspc rendezvous Reachable symbolic 1680ms 1726ms /FourDronesSystem/verificatior 
FourDronesSystem_impl_Instance.pspc safety Counterexample found random 1499ms 1622ms /FourDronesSystem/verificatior 


Fig. 1. Interface of the HYBRIDSYNCHAADL tool. 


The following named invariant property holds if, for every (initial) state sat- 
isfying the initial condition Yini, all states reachable within the time bound 
Tbound Satisfy the invariant condition Yiny. 


invariant [name]: Pini ==> Piny in time Thound 


A reachability property (the dual of an invariant) holds if a state satisfying 
Ygoal is reachable from some state satisfying Ying within the time bound Tbouna- 


reachability [name]: (init ==> Qgoal in time Tbound 


Tool Interface. The tool first statically checks whether a given model is a valid 
model that satisfies the syntactic constraints of HyBRIDSYNCHAADL. 

HyYBRIDSYNCHAADL provides two analysis methods. Symbolic reachability 
analysis can verify that all possible behaviors satisfy a given requirement;! if not, 
a counterexample is generated. Randomized simulation repeatedly executes the 
model until a counterexample is found, by randomly choosing concrete sampling 
and actuating times, nondeterministic transitions, etc. 

Our tool also provides portfolio analysis that combines symbolic reachability 
analysis and randomized simulation. HYBRIDSYNCHAADL runs both methods 
in parallel using multithreading, and displays the result of the analysis that ter- 
minates first. Symbolic analysis can guarantee the absence of a counterexample, 
whereas randomized simulation is effective for finding “obvious” bugs. 

Figure 1 shows the interface of our tool that is fully integrated into OSATE. 
The left editor shows the code of FourDronesSystem in Sect.5, the bottom 
right editor shows its graphical representation, and the top right editor shows 
two properties in the property specification language. The HyBRIDSYNCHAADL 
menu contains three items for constraint checking, code generation, and formal 
analysis. The Portfolio Analysis item has already been clicked, and the Result 
view at the bottom displays the analysis results in a readable format. 


1 Symbolic analysis currently only supports polynomial continuous dynamics, since 
the Yices2 SMT solver does not support general classes of ODEs. 
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Tool Implementation. We have (in our report [33]) developed a Maude-with- 
SMT semantics for HYBRIDSYNCHAADL that formalizes our modeling language 
and implements our tool’s analysis commands. Maude is suitable to capture the 
expressive control program language, the hierarchical structure of systems, and 
communication. Symbolic rewriting with SMT allows us to analyze infinite states 
and all possible behaviors caused by sampling and actuation times with imprecise 
clocks, where continuous dynamics can be encoded in SMT [18, 26]. 

Nontrivial control programs with many conditional branches and guarded 
transitions typically involve a large number of symbolic states; to reduce the 
number of results of executing one iteration of the system, we have implemented 
a state merging optimization technique [11] that merges two symbolic states into 
one using disjunction and generalization. As shown in the report [33], this state 
merging dramatically improves the performance of symbolic analysis and makes 
the formal analysis feasible for such distributed hybrid systems. 

The HYyBRIDSYNCHAADL tool uses OSATE’s code generation facilities to 
synthesize a Maude model from the HYBRIDSYNCHAADL model. It then invokes 
Maude and an SMT solver to check whether the model satisfies given invariant 
and reachability requirements. Our tool is implemented in around 6,200 lines of 
Maude code and around 8,600 lines of Java and Xtend code. 


5 Case Study: Collaborating Autonomous Drones 


This section shows how virtually synchronous CPSs for controlling distributed 
drones—which collaborate to achieve common goals, such as rendezvous and 
formation control—can be modeled and analyzed in HYBRIDSYNCHAADL. 


Rendezvous of Multiple Drones. Consider N drones, where vectors z; and #;, for 
1 < i < N, denote the position and velocity of the i-th drone. The continuous 
dynamics of the i-th drone is specified by the ordinary differential equation 
i = ¥;. The controller samples the drone’s position and velocity, and gives 
the new velocity value to the environment as a control command. The goal of 
rendezvous is for all drones to arrive near a common location simultaneously. 

Figure2 shows the AADL architecture of our rendezvous model for four 
drones. Each drone is connected to two other drones to exchange positions. 
A drone component consists of an environment (with the drone’s position and 
velocity) and its controller. Figure3 shows the implementation of the top-level 
component, a Drone system component, an Environment system component, and 
a thread component for a drone controller in HYBRIDSYNCHAADL. 

In each round, the controller obtains the position z from its environment at 
its sampling time. The position of the connected drone was sent in the previous 
round. The controller determines a new velocity to synchronize its movement 
with the other drones using a distributed consensus algorithm [39]. The environ- 
ment changes its position according to the velocity indicated by its controller, 
where the new velocity y becomes effective at its actuation time. 
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Fig. 2. The AADL architecture of four drones (left), and a drone component (right). 


system implementation FourDronesSystem. 
subcomponents 
dri: system Drone: 
dr2: system Drone: 
dr3: system Drone: 
dr4: system Drone: 
connections 
C1: port dr1.oX -> 
C2: port dr1.oY -> 
C3: port dr2.oX -> 
C4: port dr2.oY -> 
C5: port dr3.oX -> 
C6: port dr3.oY -> 
C7: port dr4.oX -> 
C8: port dr4.oY -> 
properties 
Hybrid_SynchAADL: :Synchronous => true; 
Hybrid_SynchAADL: :Max_Clock_Deviation => 10ms; 
Period => 100ms; 

end FourDrones. impl; 


impl 


:Drone. 
:Drone. 
:Drone. 
:Drone. 


impl; 
impl; 
impl; 
impl; 


dr2. 
dr2. 
dr3. 


iX {Timing => 
iY {Timing => 
iX {Timing => 
dr3.iY {Timing => 
dr4.iX {Timing => 
dr4.iY {Timing => 
dr1.iX {Timing => 
dr1.iY {Timing => 


Delayed; }; 
Delayed; }; 
Delayed; }; 
Delayed; }; 
Delayed; }; 
Delayed; }; 
Delayed; }; 
Delayed; }; 


system Drone 
features 
iX: in data port Base_Types::Float; 
iY: in data port Base_Types::Float; 
oX: out data port Base_Types::Float; 
oY: out data port Base_Types::Float; 
end Drone; 


system implementation Drone.impl 
subcomponents 
ctl: system DroneControl: :DroneControl.impl; 
env: system Environment: :Environment. impl; 
connections 


C1: port ctl.oX -> oX; C2: port ctl.oY -> oY; 
C3: port iX -> ctl.iX; C4: port iY -> ctl.iY; 
C5: port ctl.vX -> env.vX; 

C6: port ctl.vY -> env.vY; 

C7: port env.cX -> ctl.cX; 

C8: port env.cY -> ctl.cY; 

properties 

Hybrid_SynchAADL: :Sampling_Time => 2ms .. 4ms; 
Hybrid_SynchAADL: :Response_Time => 6ms .. 9ms; 


end Drone. impl; 


system Environment 

features 

cX: out data port Base_Types: :Float; 

cY: out data port Base_Types: :Float; 

vX: in data port Base_Types: :Float; 

vY: in data port Base_Types: :Float; 
properties 

Hybrid_SynchAADL: :isEnvironment => true; 
end Environment; 


| system implementation Environment. impl 
subcomponents 
x: data Base_Types: :Float; 
y: data Base_Types: :Float; 
velx: data Base_Types: :Float; 
vely: data Base_Types: :Float; 
connections 
C1: port x -> cX; 
C3: port vX -> velx; 
properties 
Hybrid_SynchAADL: :ContinuousDynamics => 
"x(t) = 0.001 x velx * t + x(Q); 
y(t) 0.001 x vely x t + y(@);"; 
end Environment. impl; 


C2: port y -> cY; 
C4: port vY -> vely; 


thread DroneControlThread 
features 
iX: in data port Base_Types: :Float; 
iY: in data port Base_Types: :Float; 
oX: out data port Base_Types::Float; 
oY: out data port Base_Types: :Float; 
cX: in data port Base_Types: :Float; 
cY: in data port Base_Types: :Float; 
vX: out data port Base_Types: :Float; 
vY: out data port Base_Types: :Float; 
properties 
Dispatch_Protocol => Periodic; 
end DroneControlThread; 


thread implementation DroneContro 
subcomponents 

cls: data Base_Types: :Boolean; 
annex behavior_specification {** 
variables 

nx: Base_Types::Float; ny: Base_Types: :Float; 
states 

s1: initial complete state; 
transitions 

s1 -[Lon dispatch]-> s2; 

s2 -[Labs(cX - iX) < 0.1 and 

abs(cY = iY) < 0.1]-> s3 { 


Thread. impl 


s2, s3: state; 


vX := 0; vY := @; cls := true }; 
s2 -[Lotherwise]-> s3 { 
nx := -1 * (cX - iX); ny := -1 * (cY - iY); 
if (nx > 0.3) vX := 2.5 
elsif (nx > @.15) 
if (cls) vX := 1.5 else vX := 0.0 end if 
else vX := -2.5 end if; 
if (ny > 0.3) vY := 2.5 
elsif (ny > @.15) 
if (cls) vY := 1.5 else vY := 0.0 end if 
else vY := -2.5 end if; cls := false }; 
s3 -[]-> s1 { oX := cX; oY := cY }; xx}; 


end DroneControlThread. impl; 


Fig. 3. A HyBRIDSYNCHAADL model for four distributed drones. 
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Verification. We analyze the following properties up to bound 500 ms using 
HyYBRIDSYNCHAADL portfolio analysis: (i) drones do not collide (safety), and 
(ii) all drones can eventually gather together (rendezvous). 


invariant [safety]: ?initial ==> not ?collision in time 500; 
reachability [rendezvous]: ?initial ==> ?gather in time 500; 


We define three propositions: initial, defining the range of initial positions 
of the four drones dr1, dr2, dr3, and dr4; collision, where two drones collide if 
the (horizontal and vertical) distance between them is less than 0.1; and gather, 
indicating that the distance between each pair of drones is less than 1. For 
example, collision and initial are defined as follows. 


proposition [initial] : 
abs(dr1.env.x - 1.1) < 0.01 and abs(dri1.env.y - 
abs(dr2.env.x + 1.5) < 0.01 and abs(dr2.env.y 
abs(dr3.env.x - 1.5) < 0.01 and abs(dr3.env.y - 
abs(dr4.env.x + 1.1) < 0.01 and abs(dr4.env.y + 


.5) < 0.01 and 
.1) < 0.01 and 
.1) < 0.01 and 
.5) < 0.01; 


+ 
a oe a | 


proposition [collision] : 
(abs(dr1.env.x - dr2.env.x) < 0.1 and abs(dril.env.y - dr2.env.y) < @.1) or 


(abs(dr3.env.x - dr4.env.x) < 0.1 and abs(dr3.env.y - dr4.env.y) < 0.1); 


The analysis result is shown in the Result view at the bottom of Fig. 1. 
There is a witness for rendezvous, obtained by symbolic reachability analysis in 
1.7seconds. A counterexample for safety is found by randomized simulation in 
1.5 seconds, since initial does not constrain the speed of the drones. In [33], 
we add initial velocity constraints, and verify that safety holds up to the time 
bound by symbolic reachability analysis in 15 minutes. 


6 Experimental Evaluation 


We compare the performance of HYBRIDSYNCHAADL’s symbolic analysis with 
four reachability analysis tools for hybrid automata, HyComp [18], SpaceEx [24], 
Flow* [17], and dReach [31], on models of rendezvous and formation control for 
distributed drones, and on networked thermostats (adapted from [6,29]). We use 
simplified models with less complex control; otherwise, most of the other tools 
time out (see [33] for results on more complex models). We use two invariant 
properties for each model: Inv+, which holds, and Inv, which does not hold. 

To use the other tools, we have “encoded” the synchronous designs of the 
HYBRIDSYNCHAADL models as networks of hybrid automata. Each component 
is modeled as a hybrid automaton with three modes: starting a new round, 
sampling, and controller transition/actuation. The behavior of a controller is 
encoded as single jumps. We use flat hybrid automata (obtained by HYST [12]) 
for Flow* and dReach, which do not support networks of hybrid automata. 

We measure the execution times for analyzing the invariant properties up to 
bound 500 ms, with a timeout of 60 minutes. For HYBRIDSYNCHAADL, we use 
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Table 1. HyBRIDSYNCHAADL vs. HyComp, SpaceEx, dReach, and Flow*. 


Invt Invi 
Model Tool N=2 N=3 N=4 N=2 N=3 N=4 
Time B Time B Time B Time B Time B Time B 


HSADDL 2.0 5 3.9 5 5.8 5 24 3 4.2 3 5.9 3 
~ T HyComp 0.8 5 405 1725 8.9 3 11.5 3 1926 3 
G8 SpaceEx 8.0 5 2230.3 3 4.5 1 5.1 3 2676.6 3 T/O - 
ST dReach 1382.7 3 107.1 1 T/0 - T/0 - T/0 - T/0 - 
Flowx 3552.8 4 2725.5 2 1205.2 1 167.3 3 380.4 2 838.0 3 
HSAADL 3.0 5 7.3 5 7.9 5 15.5 4 2.5 2 5.2 2 

= @ HyComp 13.35 413 5 182.1 5 T/O - 262 20.3 2 
m8 SpaceEx 91.9 2 28 1 1148 1 T/O - T/O - T/O - 
Fa‘ dReach 139.0 1 T/O - T/O - T/0 - T/O - T/O - 
Flows 1464.7 2 873.4 1 T/0 - T/O - 45.3 1 291.3 2 

2 HSAADL 2.7 5 4.7 5 7.8 5 765 1535 10.7 4 
z HyComp 1.6 5 85 5 3795 265 1555 4314 
5 SpaceEx 2.3 5 69643 345 1 2.2 5 T/0 - T/O - 
g dReach 341.6 3 57.5 1 T/0 - T/0 - T/0 - T/0 - 
H Flows 3196.4 5 1240.7 2 977.7 1 15.5 3 1718.1 4 T/0 - 
—  HSAADL 374 3784 6.9 4 142 163 2 2.8 2 
E Š SpaceEx 1147.6 3 81.1 1 T/0 - 15.2 2 T/0 - T/0 - 
2 E dReach 2156.2 3 274.3 1 T/0 - T/0 - T/0 - T/0 - 
© Flow* 232.5 2 230.1 1 T/O - 2.2 2 25.4 2 2613.8 1 


a specialized version of Maude with Yices 2.6 for polynomial arithmetic [44]. For 
SpaceEx, we use PHAVer for linear dynamics, and STC for nonlinear polynomial 
dynamics. For Flow*, we use adaptive steps, and TM orders 1 (for single) and 
2 (for double). We use the default precision for dReach, and BMC for HyComp. 
We have run all experiments on Intel Xeon 2.8GHz with 256 GB memory. 

The results are summarized in Table 1, as execution times (seconds) over time 
bounds (B - 100 ms), with N the number of components. The results for “Rend 
(double)” (rendezvous with double-integrator dynamics, where control input is 
given by acceleration instead of velocity) do not include HyComp, which does 
not support nonlinear polynomial dynamics. For Invy+, Table 1 shows the largest 
time bound for which the tool could prove the absence of counterexamples. 
Often, tools timed out when trying to verify that Invt+ holds up to time bound 
500.7 For Inv, the table shows the smallest bound for which the tool found 
counterexamples.” As seen, HYBRIDSYNCHA ADL outperforms the other tools 
in most cases, in particular for complex models with larger N. 


? E.g., for “Rend (single)” with N = 4, HYBRIDSYNCHAADL needs 5.8 seconds for 
B = 5, whereas SpaceEx needs 4.5 seconds for B = 1 and timed out for B > 1. 

3 Flow* occasionally found (spurious) counterexamples at smaller bounds, because of 
over-approximation by the Taylor model flowpipe construction. 
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7 Related Work 


Our tool can model check virtually synchronous CPSs with both complex control 
programs and continuous behaviors (and imprecise local clocks, etc.), whereas 
most formal tools are strong at analyzing either discrete or continuous behaviors. 
The latter includes analysis tools for hybrid automata [13,18,24], which do not 
deal well with the “discrete complexity” (e.g., control programs) of CPSs. 

The Hybrid Annex [2,3] allows specifying continuous behaviors in AADL, 
but message delays, clock skews, etc., are not taken into account. Controllers 
are defined in Hybrid CSP instead of in AADL’s convenient Behavioral Annex. 
Another hybrid annex is proposed in [38], and an AADL sublanguage, AADL+, 
in [35]. None of these languages support automated formal correctness analysis. 

Hybrid PALS models with simple controllers are encoded as logical formulas 
and analyzed by dReal in [8]. However, there is no tool support, and so CPSs 
must be manually modeled as SMT formulas in [8]. In contrast, we provide a 
tool for modeling Hybrid PALS models using a well-known modeling standard. 

Our work is also related to almost-synchronous systems, including approxi- 
mate synchrony [20], quasi-synchrony [15, 16,28,32], GALS [27,37], time-triggered 
architectures [41,43], etc. Our method makes it possible to verify such systems 
with continuous behaviors, which are typically not considered in related work. 


8 Concluding Remarks 


We have presented the HyBRIDSYNCHAADL modeling language and formal 
analysis tool for modeling and analyzing the synchronous designs—and, by the 
Hybrid PALS equivalence, also the corresponding asynchronous distributed real- 
time system with bounded clock skews, network delays, and execution times— 
of virtually synchronous networks of hybrid systems with potentially complex 
control programs in the modeling standard AADL. Our tool provides randomized 
simulation and symbolic reachability analysis, and is fully integrated into the 
OSATE modeling environment for AADL. We have shown that in most cases, 
HYBRIDSYNCHAADL’s symbolic analysis outperforms state-of-the-art hybrid 
systems reachability analysis tools on a number of distributed hybrid systems. 

Currently, HYBRIDSYNCHAADL’s symbolic analysis is restricted to systems 
with (nonlinear) polynomial continuous dynamics, because the underlying SMT 
solver, Yices2, cannot deal with general classes of ODEs. We should therefore 
integrate Maude with ODE solvers such as dReal [25] and Flow* [17]. 
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Abstract. Detection of bottom strongly connected components (BSCC) 
in state-transition graphs is an important problem with many applica- 
tions, such as detecting recurrent states in Markov chains or attractors 
in dynamical systems. However, these graphs’ size is often entirely out of 
reach for algorithms using explicit state-space exploration, necessitating 
alternative approaches such as the symbolic one. 

Symbolic methods for BSCC detection often show impressive perfor- 
mance, but can sometimes take a long time to converge in large graphs. 
In this paper, we provide a symbolic state-space reduction method for 
labelled transition systems, called interleaved transition guided reduction 
(ITGR), which aims to alleviate current problems of BSCC detection by 
efficiently identifying large portions of the non-BSCC states. 

We evaluate the suggested heuristic on an extensive collection of 125 
real-world biologically motivated systems. We show that ITGR can eas- 
ily handle all these models while being either the only method to fin- 
ish, or providing at least an order-of-magnitude speedup over existing 
state-of-the-art methods. We then use a set of synthetic benchmarks to 
demonstrate that the technique also consistently scales to graphs with 
more than 2!°°° vertices, which was not possible using previous methods. 


Keywords: Bottom SCC - Symbolic algorithm - Boolean network 


1 Introduction 


Finding strongly connected components (SCCs) is a basic problem in graph 
theory. It is impractical or even impossible for large graphs to find SCCs using 
explicit depth-first search, motivating the study of symbolic SCCs computation. 
The structure of SCCs in a graph is captured by its quotient graph, obtained 
by collapsing each SCC into a single node. This graph is acyclic, thus defines a 
partial order on the SCCs. Bottom SCCs (BSCCs) are SCCs corresponding to 
leaf nodes in the quotient graph (alternatively referred to as Terminal SCCs). 
Detection of BSCCs is an important problem with many applications. For 
example, in Markov chains and Markov decision processes, the recurrent states 
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belong to terminal SCCs [1, 11,38]. In LTL model checking, the detection of bot- 
tom SCCs is used during the decomposition of the property automaton to speed 
up the model checking procedure [52]. Another example of an application where 
detection of BSCCs is crucial is detecting non-terminating sections of parallel 
programs written in C or C++ [55]. In models of dynamical systems, which are 
of our primary interest, BSCCs correspond to so-called attractors that determine 
the long-term behaviour of the system [43]. Identification of attractors has many 
important applications, including communication protocols [4,47], systems biol- 
ogy [31,40], mathematical physics [26], ecology [54], epidemiology [42], etc. In 
biology, the possibility of reaching a particular phenotype of a living cell is indi- 
cated by the presence of a specific attractor [40]. The knowledge of attractors 
then unlocks the path towards cell control [33], reprogramming [49] and even 
regenerative medicine [17]. Consequently, detection of BSCCs is a fundamen- 
tal task important not only in computer-aided verification but also many other 
disciplines. 

Our motivation to develop a new symbolic approach to find BSCCs comes 
from the need to handle extremely large graphs representing labelled transition 
systems that encode the behaviour of complex real-world concurrent processes. 
In particular, assuming we deal with finite-state systems, such large transition 
systems are typically generated from models encoded in a compact formalism 
such as process calculi, Petri nets, Boolean networks [32,57], their combina- 
tions [6] or other higher-level modelling languages. For such transition systems, 
the limits of general symbolic SCC algorithms also define the limits of realistic 
applications. 

In most cases, the size of a transition system generated from a model is expo- 
nential in the number of concurrently interacting entities. For example, in the 
case of biological systems, the number of entities is typically ranging from several 
hundred to hundreds of thousands. Despite strong simplifications employed at 
the side of models, the size of respective transition systems rarely falls below 10° 
states and is usually much bigger [23,27,44]. Thus, the need to tackle large tran- 
sition systems gives us a solid motivation to revisit the algorithmics for BSCC 
detection. 

In general, it is possible to find all BSCCs as a part of a general SCC 
decomposition algorithm. There is a rich history of research on computing SCCs 
symbolically. An algorithm based on forward and backward reachability per- 
forming O(n”) symbolic steps was presented by Xie and Beerel in [59]. Bloem 
et al. present an improved O(n - logn) algorithm in [7]. Finally, an O(n) algo- 
rithm was presented by Gentilini et al. in [25]. This bound has been proved to be 
tight in [16]. In [16], the authors argue that the algorithm from [25] is optimal 
even when considering more fine-grained complexity criteria, like the diameter 
of the graph and the diameter of the individual components. Ciardo et al. [62] 
use the idea of saturation [20] to speed up state exploration when computing 
each SCC in the Xie-Beerel algorithm and compute the transitive closure of the 
transition relation using a novel algorithm based on saturation. 
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Our approach is motivated by the fact that techniques working very well for 
full SCC decomposition do not help to sufficiently accelerate BSCC detection. 
At the same time, some heuristics, such as saturation, can provide a meaningful 
impact even when using simpler algorithms [62]. The key novelty of our method 
lies in a heuristic called transition guided reduction that filters the state space by 
reflecting the possibility of transitions to appear in BSCCs. This step allows to 
remove some states not belonging to any BSCC, and that way reduce the transi- 
tion system under analysis to be tractable by the modified Xie-Beerel algorithm 
with saturation [62]. 

To target specific characteristics of transition systems representing dynam- 
ical systems, e.g., those generated by Boolean networks (BNs) [32,57], several 
specialised symbolic SCC decomposition methods have been developed. Since 
systems for our evaluation come primarily from BNs, we also discuss these spe- 
cialised methods here. A BN consists of Boolean variables, each having a Boolean 
update function. Update functions change the state of the variables. The seman- 
tics of a BN is a transition system where the states are the possible valuations 
of the variables, and the transitions are induced by the execution of the update 
functions. Some of the existing algorithms utilise the synchronous update seman- 
tics (updates of all variables executed synchronously) that significantly simplifies 
the problem [24]. However, it is known that synchronous update can produce 
unrealistic behaviour [37,53]. Models with asynchronous update (concurrently 
executed updates of variables) are closer to reproducing the real behaviour [15]. 
For the evaluation of our method, we consider asynchronous BNs. Various spe- 
cialised techniques of BSCC detection have been developed for asynchronous 
BNs, including BDDs [24,46,56, 60], optimisation [34,35], algebraic-based meth- 
ods [29], SAT [28], answer set programming [45], concurrency theory [14], sam- 
pling [61], or network structure decomposition [18,21]. Moreover, detection of 
BSCCs is also present as a necessary step in cell reprogramming [41,49] and cell 
control [2,33] based on BNs. To the best of our knowledge, existing methods 
specialised for asynchronous BNs do not satisfactorily handle huge models (hun- 
dreds of variables and beyond). The best state-of-the-art tools [21,56] are not 
yet able to robustly work with BNs of such size. We believe that the generally 
applicable heuristic we propose in this paper can significantly shift the present 
technology towards massive real-world applications (thousands of variables and 
beyond). 

The main contribution of this paper is a novel symbolic method for BSCCs 
detection in state-transition graphs of huge labelled transition systems for which 
the problem cannot be handled by existing algorithms. We introduce a novel 
reduction technique, called interleaved transition guided reduction (ITGR), 
which aims to enable the use of existing methods by removing large portions 
of the irrelevant non-BSCC states. The method relies on the observation that 
BSCCs in real-world systems rarely employ all transition labels available. There- 
fore, if a state s can fire a transition with a label that is not employed by some 
BSCC reachable from s, after applying ITGR, s is eliminated. As a result, all 
paths in the remaining state space only perform transitions with labels employed 
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by their reachable BSCCs. What makes the method truly competitive is the 
interleaving of multiple processes, each of which performs the reduction for 
a different transition label. The completion of faster processes speeds up the 
remaining parts of the computation, which would be otherwise intractable. 

To show the real-world benefits of our method, we use a wide collection of 
models and compare the prototype implementation of ITGR to the state-of- 
the-art tool CABEAN [56] as well as to our own implementation of the Xie- 
Beerel BSCC detection algorithm [62]. In particular, we consider a set of 125 
real-world asynchronous BNs selected from publicly available Boolean network 
repositories, and show that ITGR can easily handle all these models, while either 
being the only method to finish the computation or providing at least an order- 
of-magnitude speedup over existing methods. Additionally, we analyse a set of 
200 even larger but structurally similar synthetic BNs, which generate transition 
systems with approximately 21000 states. We show that ITGR is the only method 
that can consistently handle systems of this magnitude. 


2 Preliminaries 


To represent a wide variety of large discrete systems, we consider the abstraction 
generally known as labelled transition systems: 


Definition 1. Let L be a non-empty set of labels. A labelled transition system 
over L is a pair T = (S,{—> | a € L}), where S is a finite non-empty set of 
states, and for each a € L, “+ C S x S is a transition relation. 


When (s,s’) € —>+, we write s —*> s’, and when (s,s’) € -= for some 
a € L, we simply write s — s’. When there is a path sı > sg > ... > Sn, 
we write s; —>* Sn. Each labelled transition system T can be seen as a directed 
state-transition graph Gr = (S, E), whose vertices are the states of T and whose 
edges are given by the transition relations, i.e. (s,s) € E => s —> s. This 
formalism can naturally describe a wide variety of modelling frameworks with 
built-in nondeterminism, such as Petri nets, Boolean networks, or multi-valued 
regulatory networks. 

We assume to have a symbolic representation of a labelled transition system 
that allows us to perform symbolic set operations on the subsets of S (union U, 
intersection M, difference \, subset test C, pick an element PICK, etc.) as well as 
apply the following operations using the associated transition relations: 


Post(a, X) = {s € S| ds € X.s = s'} 
PRE(a, X) = {s € S| ds’ € X.s > s'} 
CanPost(a, X) = {s € X | ds’ € S.s = s'} 


We further assume a symbolic complexity model where the complexity of 
each such operation is in O(1). Additionally, we use the notation ALLPosT(X) = 


User Post(a,X) for all successors and ALLPRE(X) = Uaec PRE(a, X) for all 


Computing Bottom SCCs Symbolically Using Transition Guided Reduction 509 


predecessors. However, the symbolic complexity of these operations is in O(|L]). 
Finally, we assume that the labels in £ are sorted based on the order in which 
they influence the variables in the symbolic representation (as in, for example, 
an ordered binary decision diagram [10]). As a consequence, we index the labels 
and write £ = {a1,..., ajc] }- 

Now let us recall a few basic definitions from graph theory in order to define 
the BSCC detection problem for labelled transition systems: 


Definition 2. Let G = (V, E) be a directed graph. A subset C C V is a strongly 
connected component (SCC) of G iff it is a maximal subset such that for all pairs 
v,v' E€ C, there is a path from v to v. 

A strongly connected component C is called bottom (or terminal, BS'CC in 
the following) when there is no edge going from any v E C to any v' EV \C. 


For a given s € V, we write SCC(s) to denote the strongly connected com- 
ponent that contains s. Furthermore, we say that a set X is SCC-closed when 
X = Uzex SCC(x). This means that every SCC of G is either included in X, 
or is completely disjoint with X. As an example, a set of all reachable vertices 
from any given initial set is SCC-closed. When |SCC(x)| = 1 and (2,2) ¢ E, 
the SCC is called trivial. 

For a set X C V, we sometimes use the term the basin of X to denote the 
set of all the vertices that have a path to a state in X, formally basin(X) = {u | 
Jw € X : u —* v}. Note that although the name is motivated by the notion 
of attractor basins in dynamical systems, we use it in a more generalised form 
here, i.e. we do not require X to be a BSCC. 


Problem 1. Let T be a labelled transition system and Gr its corresponding 
state-transition graph. The problem of bottom strongly connected component 
detection (BSCC detection) is to identify all subsets of S that correspond to the 
bottom strongly connected components of Gr. 


A detailed analysis of optimal symbolic asymptotic complexity of a full SCC 
decomposition can be found in [16]. However, the authors in [16] use a slightly 
different complexity model, where operations like ALLPOST and ALLPRE also 
assume O(1) complexity. However, their observations about the relationship 
between problem complexity and graph (or component) diameter are very rele- 
vant. 


3 Basic Symbolic BSCC Detection 


First, we discuss a BSCC detection algorithm from [62], which will form our 
baseline going forward. In [62], the authors discuss several symbolic approaches 
to SCC and BSCC computation, as well as fair cycle detection using various 
symbolic algorithms, and compare them on large systems from computer science. 

In particular, the paper points out that more complex approaches, like the 
lock-step method [7], work well for full SCC decomposition but do not bring much 
benefit to the detection of BSCCs. However, the authors do highlight the benefits 
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Algorithm 1: Basic BSCC detection algorithm with saturation. 


1 Function BSCC(universe C S) 

2 while universe Æ Ú do 

3 pivot — PICK(universe); 

4 basin — Bwpb*({pivot}, universe); 

5 forward — {pivot}; 

6 repeat 

7 | (fixpoint, forward) — Fwp(forward, universe); 
8 until fixpoint or forward Z basin; 

9 if forward C basin then 

10 | OurpuT(forward); 


11 universe < universe \ basin; 


12 Function BwD* (reachable, universe) 


13 repeat 

14 (fixpoint, reachable) ~ BwbD(reachable, universe); 
15 until fixpoint; 

16 return reachable; 


17 Function BwD(reachable, universe) 
18 for i € |L|...1 do 


19 pre < universe N PRE(a;, reachable); 
20 if pre Z reachable then 

21 | return (false, reachable U pre); 
22 return (true, reachable); 


of basin saturation [19] as a heuristic to speed up the state space search. What 
we present here is therefore the Xie-Beerel algorithm [59], adapted to BSCC 
detection with saturation, based on the notes from [62] (we have rewritten the 
pseudocode to better match the presentation style and background of this paper, 
though). 

The method is summarised in Algorithm 1, which shows the main procedure 
(BSCC) as well as the reachability procedures Bwp and BwDp*, which we also 
use in the later sections. We omit the pseudocode for Fwp and FwD*, as they 
are identical to the BwD case, only swapping PRE for Post. 


Reachability and Saturation. The forward and backward reachability proce- 
dures are divided into two methods each, Fwp, Bwp, Fwp* and BwD*. Since 
they are functionally symmetrical, we only explicitly discuss backward reacha- 
bility, with everything directly translating to forward reachability as well. 

Bwbp performs a single backward reachability step and returns the new set 
of states together with an indication of whether a fixed point has been reached 
(i.e. whether no new states have been discovered). Note that in classical satu- 
ration, once BwD selects some a;, it is typically applied repeatedly. However, 
for our primary application domain (Boolean networks), multiple subsequent 
applications of a single transition would not yield any benefit; we thus use this 
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observation to simplify the pseudocode. In other cases, the recommended app- 
roach is to follow [19]. 

BwD*“ then simply wraps BwD into a cycle that actually computes the full 
fixed point of the reachable set. This separation into two sub-procedures allows 
us to perform reachability step-by-step or even interleave multiple reachability 
procedures (which will come into play later). Remember that for saturation to 
work well, the ordering of labels needs to follow the ordering of variables in the 
symbolic representation. 


Xie-Beerel Algorithm. The main algorithm relies on the well-known observa- 
tion that for a fixed pivot vertex, the SCC of this vertex can be computed as 
the intersection of vertices forward and backward reachable from pivot. When 
searching for BSCCs, we can easily extend this with two extra observations: 
First, pivot is in a BSCC when only the SCC itself is forward-reachable from 
pivot. Second, a vertex backward-reachable from pivot is either in the same 
SCC as pivot (in which case it is in a BSCC iff pivot is in a BSCC), or it is 
not in a BSCC. 

Based on these two extra observations, the original algorithm is modified in 
two ways: First, not just the SCC around pivot, but all backward-reachable 
vertices are eliminated at the end of each iteration. Second, the backward reach- 
ability from pivot is computed in full, as these are the vertices we can eliminate. 
However, the forward reachability is terminated early if it leaves the backward- 
reachable set, since this implies that pivot does not belong to a BSCC. 

The asymptotic complexity of this algorithm (in terms of symbolic opera- 
tions) is O(|£|-|S|), which follows from the fact that every vertex will appear in 
basin exactly once but may need O(|L|) operations to be discovered. Note that 
optimal symbolic algorithms for BSCC detection are expected to have linear 
asymptotic complexity. That is, however, assuming a model where ALLPOST is 
an O(1) operation, not O(|£|). This may be reasonable for some (in particular 
synchronous) systems, but as demonstrated in [19], saturation is typically more 
effective in practice, even though it is not asymptotically optimal in this model. 

In [62], the authors show very impressive performance numbers for this simple 
algorithm. However, there are two drawbacks, which we believe can be improved 
significantly. And as we demonstrate in the evaluation, while powerful, this algo- 
rithm certainly has limits on some real-world models. 

First, the performance of this method is directly tied to the selection of the 
pivot vertex. If the BSCCs of the graph are relatively small, the probability 
of picking a right pivot is also tiny (remember, even an SCC with 21° vertices 
is only a minuscule fraction of a graph with 21000 vertices). As a consequence, 
the algorithm may require a lot of pivots to explore the entire graph. Second, 
the overall complexity is limited by the diameter of the whole graph instead of 
the diameter of the BSCCs. Even if the pivot is picked perfectly, the algorithm 
still has to explore each BSCC’s whole basin sequentially. To some extent, this 
is inevitable; however, as we hope to demonstrate in the next section, it is not 
always necessary. 
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Algorithm 2: Core reduction principle 


1 Function REDUCE(pivots, universe) 

2 forward — Fwp* (pivots, universe); 

3 extendedComponent +— BwD* (pivots, forward); 
4 bottom +— forward \ extendedComponent; 

5 if universe Æ forward then 

6 basin — Bwb* (forward, universe); 

7 universe — universe \ (basin \ forward); 
8 
9 


if bottom £ Ø then 
basin — BwD* (bottom, universe); 


10 universe — universe \ (basin \ bottom); 


11 return universe; 


To sum up, Algorithm 1 is a powerful tool for the detection of BSCCs. How- 
ever, it performs best in graphs where the BSCCs either form a large portion 
of the state space or have basins of small diameter, allowing the algorithm to 
converge quickly. 


4 ‘Transition Guided Reduction 


Fig. 1. Example of transition guided reduction. Square nodes show the pivots set used 
for this reduction (in this case, the states that can fire transition a). Double-drawn 
states are the BSCCs of the graph. The green area then shows the extendedComponent 
induced by the two a transitions, and the blue area is the bottom set. The striped 
states are the basins of the two sets, which are eliminated in this reduction. (Color 
figure online) 


In this section, we introduce a technique that we call transition guided reduction 
(TGR) to eliminate a large portion of non-BSCC states. Algorithm 1 can then 
perform much better on this reduced state space. 
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We present the technique in two steps: First, in Algorithm 2, we present the 
core principle of the reduction procedure and prove its correctness. This approach 
is generally applicable to any directed graph. Then in Algorithm 3 we show how 
to apply Algorithm 2 in the context of a labelled transition system. Here, we 
can exploit the knowledge of the transition labels to guide the reduction. 

The reduction principle is described in Algorithm 2 and illustrated in Fig. 1. 
Given a set of pivot states and the current universe of all considered states, the 
method starts by computing forward—the set of all states reachable from the 
pivot states. Using this forward set, we then compute the extendedComponent 
of the given pivot states. Formally, an extended component of set X is a subset 
X’ C S that contains all states from X, as well as all paths between the states in 
X. It is a superset of the union [J „ey SCC(x) but also contains all paths (and 
SCCs on these paths) that lead between the elements of this union. 

We can observe the following properties: 


— The forward set is SCC-closed, as it is the result of a reachability procedure. 
Thus any state that can reach but is not contained in forward is not a part 
of any BSCC. 

— The set bottom (i.e. forward \ extendedComponent) is also SCC-closed (as it 
is the difference of two SCC-closed sets). Notice that if this set is not empty, 
it must contain at least one BSCC, and also that any state that can reach 
bottom but is not contained in it is necessarily not a part of any BSCC. 


The algorithm then computes the two sets of states that definitely do not 
contain a BSCC according to these observations and discards these sets from 
the state space. This is done on lines 5-7 and 8-10, respectively. 

Now we can formulate the following theorem: 


Theorem 1. If state s € universe is discarded by Algorithm 2, then it is not 
part of any BSCC. 


Proof. The proof follows from the two previous observations. If the state s is 
removed on line 7, it means s can reach a state s’ € forward. Since the set 
forward is SCC-closed, we get SCC(s) 4 SCC(s’). State s therefore does not 
belong to a BSCC. 

Similarly, if the state s is removed on line 10, then it means s can reach 
a state s’ € bottom, and again due to the fact that bottom is SCC-closed, the 
state s does not belong to a BSCC. 


However, this does not provide any guidance as to which pivots should we 
select for the reduction or why. This is addressed in Algorithm 3. Here, we go 
through all the available transitions a € £ and select as the pivots the set of all 
the states that can fire a (notice that pivots in Fig.1 also correspond to such 
states). As a result, all BSCCs that use a are contained in extendedComponent 
and all BSCCs that do not use a, but a is performed in their basin are contained 
in the bottom set. This effectively separates the BSCCs based on the transitions 
that they use. 
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Algorithm 3: Transition Guided Reduction 
1 Function TGR(universe C S) 


2 for a € L do 

3 universe — REDUCE(CANPOST(a, universe), universe); 
4 if CANPost(a, universe) = Ø then 

5 | L—L\a; 

6 return universe; 


Finally, notice that if all transitions of a certain type are eliminated, we 
remove a from £ completely. In large systems, this can significantly reduce the 
overhead of the FwD/BwD procedures that have to iterate through £. 

To better describe the cases in which this reduction works well, let us first 
formally define the following: 


Definition 3. Given a labelled transition system T and a state s, the fire set 
F(s) is the subset of transition labels F(s) C L that can be fired in state s, i.e. 
a € F(s) 64s’ € S : s = s'. A transitive fire set F*(s) is the union of all the 
fire sets F(s') of all the states s' reachable from s (i.e. s >* 8’). 


Notice that for any two states such that s >* s’, it holds that F*(s’) C F*(s). 
This also means that in any SCC, the transitive fire set of all states is the same. 


Theorem 2. Let s be an arbitrary state and s’ be a state of a BSCC such that 
s—* s'. If F*(s) 4 F*(s'), Algorithm 3 discards the state s. 


Proof. Since s —* s and F*(s) # F*(s‘), it follows that F*(s’) C F*(s). 
Let a € F*(s) \ F*(s') be arbitrary and let us consider the iteration of the 
main loop when a is selected. Assume that s has not been discarded in any of 
the previous iterations (otherwise, the proof is already finished). Let E be the 
extendedComponent computed in the current iteration. Then s is either in E, 
or s can reach E, because a € F* (s). 

If s ¢ EF, but s can reach E, then s is eliminated on line 7 of Algorithm 3 
as part of the forward basin. When s € F, it holds that s’ € bottom, since 
a ¢ F*(s'). However, since s —* s’, we know that s is removed on line 10 
because it belongs to the basin of the bottom set. 


However, note that the other implication does not hold. That is, these are 
not the only states that Algorithm 3 eliminates (this can be also seen in Fig. 1). 

Based on this theorem, we can derive two extra observations which help to 
explain the effectiveness of the reduction: 


Corollary 1. If a transition system T has a trivial BSCC, then the whole basin 
of this SCC is discarded by Algorithm 3. 


Corollary 2. If a state s is not discarded by Algorithm 3, then all paths starting 
in s in the reduced state space only use the same transitions as contained in 


BSCCs reachable from s. 
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The first corollary follows from the fact that F*(s) = @ iff s is a trivial BSCC. 
The second corollary is essentially a rephrasing of Theorem 2, but it highlights 
an important property of the reduction: if some transition label is not used by 
a BSCC, all states in its basin that use it will be eliminated. In our experience, 
real-world systems rarely use all available labels in all BSCCs (unless most of the 
state space is just a single large BSCC). Thus by using this pre-processing step, 
we can greatly simplify the work of Algorithm 1 by pruning “easily identifiable” 
non-BSCC states. 

There is one more point to be made here: Algorithm 1 has to walk the entire 
depth of the BSCC basins, which can be substantial. Meanwhile, our approach 
can often “skip ahead” because it is not starting from a single pivot but rather 
a larger subset of states. However, this may not always be sufficient. In practice, 
some transitions may be much harder to reduce than others. We address this 
problem in the next section. 


5 Interleaved Transition Guided Reduction 


While TGR can significantly reduce the number of states Algorithm 1 needs to 
consider, TGR itself iterates transitions in an arbitrary order which can signifi- 
cantly influence the speed and number of steps the reduction needs to perform. 
Removing a transition potentially reduces the number of states which subse- 
quent reductions need to consider. It is thus beneficial to perform the “easiest” 
reductions first, as this can greatly simplify the following “harder” cases. 

However, determining which reductions are “easy” and which are “hard” 
is not a simple problem. We could try to use additional structural information 
about the system to determine this, but that would limit us to a specific subclass 
of models. Instead, we let the algorithm determine this dynamically on the fly. 

Our approach is summarised in Algorithm 4. Instead of reducing one tran- 
sition relation at a time, we interleave all reductions in one procedure. This 
is done by creating a number of processes, one per each a € £, that we run 
in an interleaving fashion. The processes work in two phases: FORWARD and 
EXTENDEDCOMPONENT. The goal of a process in the FORWARD phase is to 
compute the value of the forward set starting from the states that enable an a- 
transition, and then switch to the EXTENDEDCOMPONENT phase, in which the 
goal of the process is to compute the corresponding extendedComponent set. 
The computation proceeds using the one-reachability-step functions FWD and 
Bwb, which we defined in Algorithm 1. Every process has its process variables 
that are local to each process, but their values are kept between steps: The set 
reach represents the part of forward that has already been discovered; the set 
component represents the part of extendedComponent that has already been 
discovered; the variable weight is explained below; and the variable phase holds 
the current phase of the process (FORWARD or EXTENDEDCOMPONENT). 

The process selected for execution in each iteration (line 31) is the one with 
the smallest weight. The weight of a process is determined by the size of the 
symbolic representation of the set it is currently expanding (reach in the first 


516 N. Beneš et al. 


Algorithm 4: Interleaved Transition Guided Reduction 


1 Process ITGRWORKER(a € £) 
process variables: reach, component, weight, phase 
shared variables : £, universe, processes 

2 initialisation: 

3 reach — CANPOSsT(a, universe); 

4 weight — NODECOUNT(reach); 

5 phase +— FORWARD; 

6 step if phase is FORWARD: 

7 (fixpoint, reach) — FwD(reach N universe, universe); 

8 weight — NODECOUNT(reach); 

9 if fixpoint then 


10 if universe Æ reach then 

11 basin — BwD* (reach); 

12 universe + universe \ (basin \ reach); 

13 component — CANPOST(a, universe); 

14 weight — NODECOUNT(component); 

15 phase — EXTENDEDCOMPONENT; 

16 step if phase is EXTENDEDCOMPONENT: 

17 (fixpoint, component)  BwbD(component, reach N universe); 
18 weight — NODECOUNT(component); 

19 if fixpoint then 

20 bottom — (reach N universe) \ component; 

21 if bottom £ then 

22 basin — Bwb* (bottom, universe); 

23 universe < universe \ (basin \ bottom); 

24 if CANPost(a, universe) = Ø then 

25 L<—L\ {a}; 

26 stop current process (remove it from processes); 


27 Function ITGR(universe C S) 


28 processes +— {ITGRWORKER(a) | a € £}; 
29 initialise all processes; 

30 while processes Æ Ú do 

31 p — MINBYKEY (processes, weight); 
32 | run one step of p; 


33 return universe; 


phase, component in the second). For BDDs (or MDDs), this is the number of 
nodes in the decision diagram (NODECOUNT). The algorithm thus prioritises 
processes that have the potential to advance quickly because they will use fast 
symbolic operations. 

Notice that the universe variable is shared by all processes and needs to be 
now taken into account in multiple places. This means that whenever one process 
discards some states from universe, all processes benefit from this change. This 
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update can be performed safely because whenever Algorithm 4 discards some 
states, the discarded set is SCC-closed. 

Because both Algorithm 3 and Algorithm 4 compute the same states for 
forward and extendedComponent (modulo the states eliminated by other reduc- 
tions), Theorem 1 and Theorem 2 remain valid for Algorithm 4 as well. The only 
difference is that this approach should be much more resilient to a bad ordering 
of transition relations. 

Finally, let us note that this approach should be quite simple to parallelise 
to some extent. If w parallel workers are available, the algorithm can advance 
w processes at a time instead of picking a single process. However, we do not 
pursue this approach in this paper as the other methods we use as a reference 
are also not parallelised. 


6 Evaluation 


To see how ITGR affects the performance of attractor detection for real-life 

systems, we implemented the method for asynchronous Boolean networks (BN), 

a common logical modelling framework used predominantly in systems biology. 
Using this implementation, we aim to support the following claims: 


1. ITGR performs significantly better than available state-of-the-art tools (for 
Boolean networks) on real-world models. 

2. On realistic Boolean networks, ITGR easily scales to 1000 or more variables, 
which is not possible with other methods. 

3. Interleaving plays a crucial role in making ITGR competitive. 


To evaluate the first claim, we compare our implementation with the tool 
CABEAN [56] on a set of 125 real-world Boolean networks with up to 350 vari- 
ables. We then generate a pseudo-random set of 200 networks with similar struc- 
tural characteristics to our real-world benchmarks, but with up to 1100 variables. 
We show that ITGR can successfully deal with models of this magnitude as well. 
Finally, we compare the performance of ITGR with the basic attractor detection 
algorithm as well as with “sequential” TGR on both benchmark sets, showing 
that ITGR is overall faster and is the only method able to handle the large 
benchmarks efficiently. 

The whole set of benchmarks, as well as the implementation of all the algo- 
rithms in Rust, is available as a paper artefact at Zenodo!. Additionally, the 
method is successfully employed by our tool AEON that facilitates long-term 
analysis of Boolean networks [3]. 

Before we present the actual benchmark results, let us also first briefly com- 
ment on the modelling paradigm chosen (Boolean networks) and the actual setup 
used to perform the measurements. 


1 https: //doi.org/10.5281 /zenodo.4709882. 
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6.1 Boolean Networks 


A Boolean network, as the name suggests, consists of n Boolean variables, each 
variable having an associated update function b;. The state space of the network 
consists of 2” Boolean variable valuation vectors, {0,1}". Each update function 
takes the current state of the network and produces a new value that is assigned 
to the associated variable, b; : {0,1}" — {0,1}. We assume the update func- 
tions can be applied non-deterministically, resulting in an asynchronous updating 
scheme. This is not the only updating scheme used in practice (e.g. synchronous 
or generalised asynchronous can be used as well) but is generally considered to 
cover the possible behaviour of the biological system well. 

Typically, an update function of a particular variable x only depends on a 
smaller subset of the system variables. In such a case, we say that these variables 
regulate x (specifically, y regulates x if the update function of x depends on 
the value of y). The number of such regulations in a Boolean network can be 
viewed to represent the connectedness or structural complexity of the network in 
general. In short, the more regulations the network has, the more complex update 
functions it contains, possibly resulting in more complex behaviour. Variables 
and regulations together form a directed regulatory graph. 

A Boolean network with an asynchronous updating scheme fits naturally into 
our definition of a labelled transition system. The state space of the network vari- 
ables corresponds with S, i.e. S = {0,1}”. Each transition a; of £ corresponds 
to the application of the i-th Boolean update function b; to the i-th Boolean 
variable, i.e. (s, s’) € a; & 8’ = sli — di(s)] As’ # s. 

When dealing with Boolean networks, BSCCs are typically referred to as 
attractors. The rationale behind this term is that the BSCCs are the states 
where the fair runs of any system eventually converge to—the behaviour is thus 
attracted towards these states. In the following, we use these two terms inter- 
changeably. 

As a symbolic representation, the most natural choice for Boolean networks 
are Reduced Ordered Binary Decision Diagrams (ROBDD, or BDD) [10], as 
these can be easily used to represent sets of Boolean vectors. We do not make any 
specific optimisations with regards to variable ordering, but to enable saturation- 
like reachability, we assume that the ordering of transitions a1,...,a@,, follows 
that of the variables in the ROBDD that they update. 

Since a Boolean network consists of n Boolean variables, a set of states of 
such a network can be seen as a Boolean formula (represented as a BDD) over the 
network variables. Here, a state belongs to such a set iff it represents a satisfying 
assignment of this formula — a fairly standard approach to state-space encoding 
using BDDs. To apply a particular update function, we must first construct a 
BDD describing all states where the update function should change the value of 
its associated variable (note that this BDD can be reused in subsequent steps). 
By computing an intersection of a state set with this BDD (yielding the result 
of the CANPOST operation) and then performing a “bit flip” of the updated 
variable in the result BDD, we obtain a set of successor states with respect to 
this one update function (i.e. POST). Similarly, we can obtain CANPRE and PRE. 
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6.2 Benchmark Set-Up 


Real-world Models. To provide the best possible real-world evaluation of 
our method, we have collected all the models from publicly available Boolean 
network repositories that we were aware of, and that support the universally 
accepted SBML-qual format [12] for model transfer. Specifically, our benchmark 
set includes models available through GINsim [13], Cellcollective [30], Biomod- 
els [39] and the COVID-19 disease map project [48]. Together, the benchmark 
set consists of 125 models, peaking at 351 variables and 1100 regulations, respec- 
tively. 

Note that some of these models contain Boolean constants (also called inputs 
or parameters) that can be specified by the user. For such models, we performed 
a simple parameter sampling to determine if some of the valuations result in non- 
trivial attractors, as these are the main focus of this paper. If such valuation was 
found, we have used it in our benchmark set. However, for the vast majority of 
models (approximately 90%), there were either no tunable parameters or the 
sampling did not find any significant changes in the structure of attractors. 


Environment. We ran all the benchmarks on a machine with a modest 4-core 
i7 4790 and 32GB of RAM. However, none of the benchmarks used more than 
one core at a time, and typical memory consumption was significantly below 
1GB. Hence our evaluation should be reproducible even using a much slower 
machine. 

We have measured the runtime for each model using the standard Unix time 
utility, with a one-hour timeout per benchmark model. We have run a large 
portion of the benchmarks repeatedly but have not observed any significant 
variance in runtime; we thus only report average runtime values. 


CABEAN. In the real-world performance test, we compare our method to 
the tool CABEAN [56]. To the best of our knowledge, CABEAN is both the 
most recent and the most advanced tool that targets the detection of non-trivial 
attractors in asynchronous Boolean networks. Other tools that we know of (such 
as [14,21,36]) are not built for systems of the size we are dealing with (for exam- 
ple, due to explicit state-space representation). CABEAN focuses on Boolean 
network reprogramming, but as a necessary component of this process, it also 
provides state-of-the-art methods for attractor detection. Specifically, CABEAN 
uses symbolic manipulation using BDDs, just as our method, but implements 
advanced decomposition techniques [50] to reduce the state space of the network. 


6.3 Real-World Networks 


The core of our results is summarised in Fig. 2. On the left, we see a comparison 
of total successfully completed benchmarks by both CABEAN and ITGR, and 
on the right, we have relative speedup for each individual benchmark. On the 
right, we only show benchmarks that took CABEAN more than one second to 
complete (remaining models would be normally easy to compute even without 
any special techniques). 
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Fig. 2. The left plot shows the total number of benchmarks that each tool has com- 
pleted before a certain time limit. The dashed line represents CABEAN, whereas the 
solid line shows our ITGR implementation. On the right, the graph displays the rela- 
tive runtime between CABEAN and ITGR. The dotted lines represent 10x and 100x 
speedup compared to CABEAN. The solid circles are the benchmarks where CABEAN 
successfully computed the attractors. The crosses represent the benchmarks where 
CABEAN was able to finish the decomposition but failed to extract the actual attrac- 
tors. Notice that we use logarithmic scaling for the time in both graphs. 


In this test, ITGR completed all but one benchmark in less than 1 min. The 
one remaining case took almost 15 min to complete. However, the reduction 
process for this model was also quite fast at roughly 100s. The rest of the com- 
putation was spent on identifying the 352 non-trivial attractors in the remaining 
state space (together, the attractors account for almost 2°° states — by far the 
largest we have seen in any model). Out of the 125 benchmarks, we uncovered 
non-trivial attractors in exactly 40 models (however, this also includes 6 models 
with only small 2- or 4-state attractors). 

On the other hand, CABEAN failed to compute attractors for 19 of these 
125 models (15.2%). Upon closer inspection, all but one of these 19 benchmarks 
contained non-trivial attractors, which means CABEAN failed for 45% of models 
with non-trivial attractors. 

However, we note that on some models, CABEAN did not simply timeout 
but actually terminated early due to a segfault. While this behaviour does not 
seem to be directly linked to the total size of the attractors, it certainly appears 
to be more common in such networks. We have seen this happen in networks 
with relatively small attractors, while other networks (even one with a 230- 
state attractor) were completed successfully. We hypothesise that this occurs 
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when the decomposition is (at least partially) successful but does not reduce the 
complexity of the network enough to continue with attractor search?. 

These failed attempts are visualised in the right plot using crosses, as they 
represent an interesting lower-bound approximation on the performance of this 
decomposition technique. Overall, the plot shows that for the vast majority 
of models, ITGR provides an order of magnitude (10x) speedup compared to 
CABEAN, with some (especially larger) models attacking or exceeding the 100x 
speedup threshold. 

Overall, we have shown that ITGR is capable of solving all publicly available 
problem instances (that we know of), and it outperforms current state-of-the-art 
decomposition methods with the median of 16x speedup (77x average). Natu- 
rally, we also compared the actual attractors found by CABEAN and ITGR, 
and we are happy to report that we found no inconsistencies between the two 
methods. 


6.4 Pseudo-random Networks 


Next, we set out to test the limits of ITGR on larger models than the ones avail- 
able in the public repositories today. Specifically, we wanted to test networks 
with 1000 or more Boolean variables. While such a number is arguably not pos- 
sible to achieve in a single hand-made model, fully or semi-automated machine 
learning techniques [5,8,22,51] are making models of this magnitude much more 
approachable. 


Pseudo-Random BNs. To create a benchmark set of larger models, we have 
decided to generate pseudo-random networks structurally similar to our real- 
world benchmark set. Biological systems, specifically protein and gene regulatory 
networks, are known to follow certain properties of small-world networks [9, 
58]. However, aside from other differences, they are directed and typically quite 
sparse (our real-world benchmark set has the average node degree of 4.3). This 
makes most common random network models unsuitable for this specific task. 
For example, the famous Watts—Strogatz model would, in this case, assume that 
the average degree is significantly larger than In(1000) ~ 7. 

We have thus first measured the relative in- and out-degree distributions 
in the regulatory graphs of our real-world networks and then generated ran- 
dom networks by sampling from this distribution to approximate the real-world 
dataset. Additionally, regulatory graphs of Boolean networks are essentially 
always weakly connected. In each model, we have thus filtered out all vari- 
ables except the largest weakly connected component. Note that this makes the 
dataset slightly skewed towards more connected networks (i.e. more challeng- 
ing), as these have a higher chance of being weakly connected when randomly 
generated. However, it is still well within the connectivity limits expected based 
on the real-world dataset. 


? Developers of CABEAN have confirmed that this assumption is essentially accurate, 
i.e. for some structurally complex networks with large non-trivial attractors, the tool 
can segfault instead of timing-out. 
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To generate the boolean functions, we have measured that 80.7% of the regu- 
lations in the real-world dataset are positively monotonous, with the remaining 
being negatively monotonous (monotonicity is typically expected in biological 
networks). Each regulation was thus assigned monotonicity based on this dis- 
tribution, and a function was generated by randomly choosing between A and 
V when connecting the positive/negative literals. Note that this does not cover 
the full spectrum of possible Boolean functions, but it is well within reason for 
biological networks, where some techniques tend to even implicitly assume the 
function is just a simple conjunction/disjunction of literals. 
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Fig. 3. Runtime comparison of ITGR, “sequential” TGR and the basic symbolic BSCC 
detection. The left plot shows the real-world benchmark set with up to 350 variables 
per model. On the right, we see the medium synthetic benchmark ranging from 50 to 
1100 variables. Note that the large synthetic benchmark (~ 1000 variables only) is not 
shown, as ITGR was the only method capable of actually completing these models. All 
the time axes have a logarithmic scale. 


Performance. In the end, we have obtained two benchmark sets: A medium set 
with 100 networks ranging from 50 to 1100 variables, and one large benchmark 
set, also with 100 models, but all with ~ 1000 variables and ranging from 2471 to 
5099 regulations. Out of these 200 models, we discovered non-trivial attractors 
in 61 of them. 

The runtime for the medium benchmarks is summarised in Fig.3 (right). 
Here, we see that ITGR successfully completed all instances within 10 min. For 
the large benchmark set, ITGR consistently finished 98% of the models within 5 
to 10 min. The remaining two outliers took 28 and 55 min to complete. Similar 
to what we saw in the real-world benchmark, these models contained very large 
non-trivial attractors (the largest having again more than 2°° states) and were 
thus not limited by the speed of the reduction but by the diameter of the actual 
attractors. 
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Additionally, for each reduction procedure, we kept track of the actual num- 
ber of reachability iterations that needed to be performed (specifically, how many 
times we applied line 21 of BwD as shown in Algorithm 1, or the same line in 
Fwp). For all models, this number was well below 10000 iterations, which is 
quite low considering the procedure needed to evaluate up to 1100 transition 
relations. In particular, this supports our hypothesis that ITGR works well due 
to typically short distances between states that can fire individual transitions. 


6.5 Interleaving Performance Impact 


Finally, we would like to evaluate the influence of smart interleaving on the 
performance of ITGR. For this purpose, we consider three algorithms: 


1. the basic symbolic algorithm with saturation as described in Sect. 3; 

2. TGR, as described in Sect.4, applied to variables in the order in which they 
appear in the network declaration (that is, without any interleaving); 

3. the full ITGR as described in Sect. 5. 


Keep in mind that all three approaches use the same implementation of 
symbolic representation and differ only in the actual attractor detection. Also, 
TGR/ITGR use the basic algorithm to actually identify attractors once the 
reduction is completed. Any speedup between TGR and the basic algorithm 
can be thus directly attributed to the state-space reduction, while any speedup 
between ITGR and TGR is due to the introduction of interleaving. 

For the real-world benchmarks (up to 350 variables) and medium synthetic 
benchmarks (50 to 1100 variables), the comparison is presented in Fig.3. Here, 
we see that the basic algorithm is indeed not generally sufficient for large net- 
works, finishing only 62 of the 125 real-world models and only 5 of the medium 
synthetic benchmarks (the main reason was typically poor pivot selection; how- 
ever, some instances also timed out due to long reachability procedures). 

The difference between TGR and ITGR is not as drastic for real-world mod- 
els. TGR finished in 122/125 instances but was consistently slower than ITGR, 
especially on larger models (on one instance, we have even seen a 55 min vs 2.9s 
speedup, i.e. more than 1000x). However, as we look into even larger graphs with 
the medium synthetic benchmark set, ITGR easily outperforms TGR, which 
completed only 26/100 instances. 

Finally, for the large benchmarks, all with ~ 1000 variables, we have ITGR 
completing all benchmarks within the 1-h timeout (with 98% finishing within 
10 min); no other method has finished any of the 100 models within this limit. 
This leaves ITGR as the only implementation in this comparison capable of 
successfully analysing networks with 1000 or more Boolean variables. 


7 Conclusions 


In this paper, we present a novel symbolic method for BSCC detection in state- 
transition graphs of labelled transition systems, called interleaved transition 
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guided reduction (ITGR). The method relies on the observation that BSCCs 
in real-world systems rarely employ all transition labels available. Therefore, if 
a state s can fire a transition with a label that is not employed by some BSCC 
reachable from s, after applying ITGR, s is eliminated. As a result, all paths in 
the remaining state space only perform transitions with labels employed by their 
reachable BSCCs. If the system has only trivial BSCCs, this solves the problem 
completely. For non-trivial BSCCs, this may make the problem tractable using 
previously known techniques. 

ITGR relies on smart interleaving to prioritise the elimination of “symbol- 
ically easier” transitions. Completing the reduction in this order allows ITGR 
to subsequently simplify the analysis of transitions which would initially be too 
complex to handle. 

We tested the method on a large benchmark set of real-world Boolean net- 
works (up to 350 variables) as well as randomly generated benchmarks (up to 
1100 variables) with similar structural properties. Our experiments show that 
ITGR significantly outperforms the state-of-the-art tool CABEAN and can eas- 
ily handle all models from both benchmark sets, pushing the boundary of what 
was previously possible in this field. 
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Abstract. Semi-algebraic abstraction is an approach to the safety verifi- 
cation problem for polynomial dynamical systems where the state space 
is partitioned according to the sign of a set of polynomials. Similarly 
to predicate abstraction for discrete systems, the number of abstract 
states is exponential in the number of polynomials. Hence, semi-algebraic 
abstraction is expensive to explicitly compute and then analyze (e.g., to 
prove a safety property or extract invariants). 

In this paper, we propose an implicit encoding of the semi-algebraic 
abstraction, which avoids the explicit enumeration of the abstract states: 
the safety verification problem for dynamical systems is reduced to a cor- 
responding problem for infinite-state transition systems, allowing us to 
reuse existing model-checking tools based on Satisfiability Modulo The- 
ory (SMT). The main challenge we solve is to express the semi-algebraic 
abstraction as a first-order logic formula that is linear in the number of 
predicates, instead of exponential, thus letting the model checker lazily 
explore the exponential number of abstract states with symbolic tech- 
niques. We implemented the approach and validated experimentally its 
potential to prove safety for polynomial dynamical systems. 


1 Introduction 


Non-linear dynamical systems are characterized by continuous evolution result- 
ing from ordinary differential equations containing non-linear polynomials. Prov- 
ing safety properties for non-linear dynamical systems is extremely challenging, 
and several approaches have been proposed. Semi-automatic deductive verifi- 
cation techniques based on theorem proving include proving hybrid programs 
using differential dynamic logic [27] or hybrid Cyber Physical System (CPS) 
using Hybrid Hoare Logic (HHL) [21]). Among various automatic techniques 
(e.g., [30]), an important line of work applies symbolic model checking to abstrac- 
tions of hybrid systems, both with, using qualitative predicate abstraction [34]. 
Unfortunately, the problem with the above techniques is twofold. On one side, 
the abstractions are often unable to precisely lift important information, thus 
© The Author(s) 2021 
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resulting in an abstract system that is not strong enough to prove the prop- 
erty. On the other side, the abstraction computation may be too expensive to 
compute, especially in the non-linear case. 

To tackle the first problem, we consider the semi-algebraic decomposition for 
dynamical systems of [32]. The idea is to build an abstraction from a given set of 
polynomials, partitioning the concrete state space according to the sign of each 
polynomial. The abstraction is exact: there is a transition from an abstract state 
to another abstract state if and only if there is (at least) a concrete transition 
from the two concretizations of the abstract states. Semi-algebraic decomposition 
is also appealing because it can be made more precise adding new polynomials. 

The abstraction can be computed by means of logical operations (by repeat- 
edly checking the satisfiability of quantifier-free formulas interpreted over the 
reals). However, the second problem remains: the explicit computation of the 
abstraction is extremely costly, since it requires the enumeration of all possi- 
ble transitions between abstract states, which are exponential in the number of 
considered polynomials. 

Interestingly, an effective use of abstraction is at the core of the most suc- 
cessful verification techniques for discrete infinite-state transition systems. The 
technique of predicate abstraction [16] was originally adapted for symbolic veri- 
fication in [9] and then optimized in [19]. This idea has been further developed 
in implicit predicate abstraction [35], which eliminates the burden of an up-front 
exponential blowup in the computation of the abstract states by embedding the 
abstraction in the symbolic encoding of the transitions. This approach has been 
used also in combination with IC3 [1,5, 6]. 

In this paper, we propose a new approach to the verification of dynamical 
systems with non-linear polynomial dynamics based on the use of semi-algebraic 
decomposition. The contributions of the paper are the following: 


— We cast the problem of computing and verifying properties of dynamical sys- 
tems using the semi-algebraic decomposition in the framework of verification 
via implicit predicate abstraction (i.e., a first-order logic characterization of 
the semi-algebraic decomposition abstraction). Thus, we apply SMT-based 
model checking techniques to prove safety properties of polynomial dynami- 
cal systems. 

— We define a linear symbolic encoding for the abstraction. Note that the naive 
formulation of the predicate abstraction problem (which follows from the 
explicit computation approach proposed in [32]) is not effective in practice: 
in fact, the number of abstract states is exponential in the total number of 
polynomials that define the abstraction, and the encoding requires to enumer- 
ate all the possible pairs of abstract states to check the existence of an abstract 
transition. We exploit the properties of the LZZ formulation to define a con- 
cise encoding that is linear in the number of the polynomials, hence making 
the approach feasible in practice. 

— We implement and experimentally evaluate the approach. The results show 
how the reduction to the verification of discrete infinite-state transition sys- 
tems is complementary to reachability analysis techniques and proves cases 
that were previously out of reach for the state-of-the-art tools. 
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Outline: The rest of the paper is structured as follows: Sect. 2 gives an overview 
of the approach with a motivating example; Sect. 3 provides the background def- 
initions; Sect. 4 shows the naive encoding of the abstraction, while in Sect. 5 we 
derive the linear encoding and define the related implicit semi-algebraic abstrac- 
tion; in Sect. 6, we present the experimental results; in Sect. 7, we discuss the 
related work and, finally, in Sect. 8, we draw some conclusions and directions for 
future work. 


2 Overview of the Approach 


Consider a verification problem (adapted from [22]) on the non-linear dynamical 
system with two variables x and y, and differential equations ¢ = —2y,y = 2”. 
We want to prove that the system cannot reach the set of bad states (a +2)? + 
y?—1 <0 (i.e., it never leaves the safe region (x+2)?+y?—1 > 0) when starting 
from the initial set of states x — y — l >0Ax+2> 0. Note that although in 
this example the evolution of the system is not restricted, our approach can deal 
with the more general case in which the evolution is constrained by an invariant 
condition that must always hold. The system is safe and avoids the set of bad 
states (see system’s dynamic in Fig. la). 

We can prove that the system is safe by first constructing and then model 
checking a discrete semi-algebraic abstraction [32]: given the set of polynomials 
A := {@-—y- $x tyt $x + 2}, the semi-algebraic abstraction partitions 
the state space according to the sign ({>,<,=}) of the polynomials in A (an 
example of abstract state is the state r +2 >0Agz-—y-— 4 <OAK%+Yy+ i <0 
represented as Q) in Fig. 1b). There exists a transition from an abstract state to 
another one if the two states are neighbors and there exists at least one trajectory 
of the dynamical system going from one state to the other. The existence of such 
condition can be checked using the LZZ algorithm [22], which checks if a semi- 
algebraic set Ņ% is a differential invariant for a polynomial dynamical system 
f when its execution is restricted to the domain H (another semi-algebraic 
set). The algorithm reduces the invariant check to the satisfiability of the Non- 
Linear Real Arithmetic Theory formula LZZ,, f ,H(Z), where Z is a set of real- 
valued variables. We can systematically check if there exists a transition from an 
abstract state sı to the abstract state s2 proving that sı is not invariant when 
restricted to the domain sı V se (i.e., checking that LZZs,,f,sıvs2(Z) is false). 

Furthermore, we can use an algorithm, called LazyReach [32], to compute the 
forward set of reachable abstract states starting from the initial states. As usual, 
if no abstract states intersect the set of bad states then the system is safe, and the 
reachable set of abstract states is a continuous invariant for the system. Figure 1b 
shows the state space of the dynamical system: the initial and bad states of the 
verification problem (represented with the green and red region respectively), 
the solution of the polynomials from A (represented as blue lines), and further 
superimpose the set of reachable abstract states and transitions (represented 
as numbered circles and arrows between the circles). The abstraction shown in 
Fig. 1b is the result after applying LazyReach to the verification problem. 
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Fig. 1. Safety verification problem and reachable states of the abstraction for the non- 
linear dynamical system + = —2y,y = x°, bad states (x + 2)? + y? —1 < 0 (red 
circle), and initial set of states x — y — 4 > 0Aa+2> 0 (green region). Figure (a) 
shows the verification problem and the system’s vector field. Figure (b) shows the 
reachable abstract states and the transitions of the algebraic abstraction (numbered 
circles and arrows) computed using LazyReach and the differential invariant (green and 
gray regions) obtained from the set of polynomials A = {x — y 5,2 ty 4 4, x +2} 
(blue lines), computed using Implicit Abstraction. Abstract states represent different 
combinations of signs for the abstraction’s polynomials. Examples of abstract states 
are) +2 > O0Ar—y—§ < OArt+yt+3 <0, @2+2 > 0Ar—y—3 =0Ar+y+4 <0, 
and @ 2#+2>0Aa—-y 4 =OAr+y+ 4 = 0. (Color figure online) 


A main challenge for the LazyReach algorithm is to explicitly enumerate the 
reachable states and transitions among them, since their number is exponential 
in the number of polynomials A (i.e., the number of total states is already 3!“!). 
For the example above, where we have 3 polynomials, the maximum number of 
states would be 27, with an even bigger number of transitions (e.g., one must 
consider the transition between each pair of neighbouring abstract states). Even 
if LazyReach enumerates the reachable abstract states on-the-fly, the explosion 
in the number of states and transitions is still a bottleneck. Our implementation 
of LazyReach applied to the above example explores a total of 9 states and checks 
the existence of 27 transitions, taking about 12s to complete. 

A possible solution to tackle the state explosion problem is the DWCL algo- 
rithm, proposed in [32]. The DWCL algorithm! tries to reduce the number of 
abstract states by checking if the sign of a polynomial a € A is invariant, that 
is if: 


1 We provide the main intuition behind the DWCL algorithm and we refer the reader 
to [32] for a detailed exposition. 
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— the sign of the polynomial a does not change in the initial states (i.e., the 
predicate a0, with me {<, >, =}, holds for all the initial states); and 

— a œX 0 is a continuous invariant for the dynamical system (this can be checked 
with LZZape0,f,H(Z))- 


When a predicate a > 0 is a continuous invariant, the algorithm strengthens the 
invariant of the dynamical system (by adding a r< 0 to the invariants), allowing to 
remove a from the set of polynomials A. While the DWCL algorithm may already 
find a strong-enough invariant to prove the safety property, the algorithm falls 
back to the LazyReach algorithm in the general case to explore the abstract 
state space, hopefully with a strengthened invariant domain and a smaller set 
of polynomials. In practice, the state-space explosion problem of LazyReach still 
exists in the case “not enough” polynomials are sign-invariant, as it happens in 
our motivating example. In the example, no polynomials are sign-invariant?: this 
means that the DWCL algorithm will not remove any polynomials from the set 
A and LazyReach will still suffer from the state-space explosion problem. 

The semi-algebraic abstraction is a specific instance of predicate abstrac- 
tion [16] of the dynamical system f. For discrete-state systems, there exist effi- 
cient algorithms to either explicitly compute the abstraction using Satisfiability 
Modulo Theory (SMT) solvers [19,20] or to implicitly represent the abstraction 
and directly verify a safety property (e.g., implicit predicate abstraction [35]). 
Since these algorithms work on a fully symbolic representation of the abstract 
state space, they can cope with the state-space explosion due to the number of 
predicates of the abstraction. However, applying the same symbolic-state tech- 
niques to compute or verify the semi-algebraic abstraction is still challenging, 
mainly because it requires to express the transition relation T(X, X’) of the 
semi-algebraic abstraction in a first-order logic formula. We notice that such 
transition relation T can be directly obtained from the abstraction’s definition®: 


Z. ( VV si(X) A 82(X') A (bl nl) ' 


(s1,82)€34 


The above transition relation enumerates all the possible pairs of abstract states 
and its size is exponential in the number of polynomials in A. The additional 
variables Z are copies of the state variables of the system and are used to encode 
the LZZ condition. Clearly, even creating such formula is not scalable and hinders 
the application of the standard abstraction and verification techniques used for 
discrete systems. Note that, while the LZZ algorithm works for semi-algebraic 
sets (i.e., the candidate invariant ~ and the invariant states H are both arbitrary 
Boolean combinations of non-linear arithmetic terms), here we apply LZZ to 


? The differential-cut (DC) and the differential divide-and-conquer (DDC) proof rules 
used in DWCL fail for all the polynomials from A, so DWCL would not remove any 
polynomial. 

3 For clarity, here we do not include additional constraints in the transition relation, 
such as the neighborhood relation, which instead we consider later in Sect. 4. 
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check the existence of a transition between two abstract states, hence we still 
have to explicitly enumerate the abstract transitions. 

Our main contribution, presented in Sect. 5, is a compact formulation of the 
above transition relation that has a size linear in the number of the polynomials 
A. The steps to obtain such exponentially smaller transition are: 


1. We specialize the LZZ formula ~LZZ,, 7 ,s5,vs,(Z) to encode the existence of 
a transition between two abstract states sı and s2. The resulting formula is a 
disjunction, and each disjunct encodes the necessary and sufficient condition 
for a continuous transition to s2 to exist, either inside the set s1(Z) or outside 
the set —s,(Z). Intuitively, we obtain a specific encoding for checking the 
existence of an abstract transition, instead of reusing the LZZ as a “black 
box”. 

2. We “lift” the above disjunction to the disjunction of all the abstract states, 
obtaining the formula: 


AZ.(InsExpl,(X, X',Z)V OutEzpl; (X, X',Z)), 


where InsExpl; (X, X’, Z) encodes the “inside condition” for all the pairs of 
transitions (and similarly for the “outside condition” OutExpl,(X, X’, Z)). 

3. The formula InsExpl,y(X,X’,Z) still contains an explicit enumeration on 
the pairs of abstract states. We show how we obtain an equivalent formula, 
InsSymb;(X,X',Z), that encodes the same condition for each polynomial 
a € A in the abstraction, obtaining a linear, instead of exponential, encoding. 
We apply the same reasoning on OutEzxpl;(X, X’, Z). 


We then use the concise transition relation of T to obtain a symbolic tran- 
sition system Sym pip that implicitly encodes the semi-algebraic abstraction for 
the dynamical system f with the polynomials A and predicates P = {a ™ 
O|aeEAA xE {>,<,=}}. Technically, instead of computing the predicate 
abstraction, we encode an implicit abstraction [35]. Consequently, we avoid the 
expensive quantifier elimination step. We can then verify the safety property on 
the transition system Sim 1,p using an SMT-based model checking algorithm. We 
use the algorithm from [4], since Sjmpi,p contains non-linear arithmetic formulas. 
Our approach verifies the example of Fig. 1 and finds the continuous invariant: 


1 1 1 
JA (x-y 2 5 Vzr+y>-5), 


( < V > 2) A ( = > -Vr > 
x x x + 
y = ¥=5 Y zZ 2 2 2 


2 


which is shown in the union of the green and gray regions in Fig. 1b. 


3 Preliminaries 


In this work, we consider first order logic formulas in the theory of non-linear 
arithmetic over the reals (NRA). We denote with ¢(X) the formula ¢ containing 
free variables from the set X = {x1,...,%n}. We simplify the notation of the 
formula ¢(X) to ¢ when the set X is clear from the context. 
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Invariant Verification for Polynomial Dynamical Systems 


Safety Verification of Dynamical Systems. Given a set of variables X we write 
X = [z1,..., £n]! to specify a vector containing all the variables in X ordered 
lexicographically. We use the subscript X; to access to the i-th element of the 
vector. We focus on polynomial dynamical systems of ordinary differential equa- 
tions (ODEs) X = f(X), where X is the vector of first-order derivatives of the 
variables X and f(X) is a vector of polynomials (i.e., f;(X) is a polynomial). 
The safety verification problem consists of proving that every trajectory of the 
dynamical system X = f (X) starting inside the initial set of states y and while 
being inside the evolution domain constraints H remains inside the safe set of 
states ¢. We write the problem using the differential dynamic logic [27] formula: 


yp [X = f(X) & Hjọ, (1) 


asserting that if the system is in a state satisfying the pre-condition ¢ (the initial 
states) this implies (— operator) that all the trajectories evolving according to 
the ODE X = f(X) and evolution domain H (box modality []) will satisfy the 
post-condition ¢ (the safe states). Formally, the system is safe if: 


Vao E w.Vr > 0.Vt € [0,7].((y(a0,t) E€ H) > v(ao,t) € ), 


were the differentiable function y : R"+! — R”, such that 4(y(ao,t)) = 
f(y(a0,t)), is the solution to the initial value problem x € R” (i.e., y(ao, t) 
describes the state the dynamical system f reaches after t € R time when start- 
ing in the initial state £o). 

The problem of proving the system is safe can be reduced to find a formula 
0(X) such that: i) H Ay > 9, ii) 0 > [X = f(X) & H] 9, and iii) 6 > ¢. 
0(X) is a continuous invariant [28] that contains the initial states and that is 
contained in the safe states. 


LZZ Algorithm [22]. The LZZ algorithm reduces the problem of checking if 0 is 
a continuous invariant to checking the validity of the following formula: 


LZZo,7,H(X) — ((0(X) A H(X) A Inf u (X)) — Ing o(X))A (2) 
((70(X) A H(X) A In-f,u(X)) > 7In_z0(X)), 


where the formula Inf (X) for the ODEs f and the formula y represents the set 
of states which will evolve inside the set y for some non-zero time in the future. 
Respectively, the formula In_ yf (X) represents the set of states evolved inside 
the set y for some non-zero time in the past, and — f represents the dynamical 
system evolving in “reverse”. Note that the construction of the formula Inf (X) 
assumes to be in disjunctive normal form (DNF): 


y= V N a&X)ex0, 
dé disj(y) ax0Epred(d) 
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where disj(y) are the disjuncts of a formula y, pred(d) are the predicates in the 
disjunct d, and me {>,>}*. The formula Ins (X) is defined as: 


Ing y(X) = VV \ Ing avao(X). (3) 


dédisj(y) ax0Epred(d) 


The formula Inf apao(X) encodes the set for a single predicate a & 0 using the Lie 
derivatives of the polynomial a(X). The i-th Lie derivative Lia of a polynomial 
a(X) with respect to the ODEs f is defined recursively as: 


d GH 
0) : 1 
La a, Lpa= aelt af. 
Ing a>o(X) encodes that the first non-zero Lie derivative of a must be positive 
in order for the trajectories of the system to enter the set a > 0 and stay inside 
the set for a positive time°(see [22] and [12] for a thorough explanation): 


ngaso X> V ( VAN 1Pa=on1fa>0), (4) 


O<i< Nap \O<j<i 


Ing a>0(X)=Inga>ro(X)V N LPa=0, (5) 
0<i< Na, 


where Na, s is an integer constant and is an upper bound on the minimum integer 
number r (called rank) such that ie # 0 (for all x € R”). Naf can be 
computed using Grébner basis as explained in [22]. 

In the following, we will only use the fact that the formula Ins „(X) for the 
DNF formula y is the DNF formula where In, is applied to the predicates (as 
shown in Formula (3)). 


Semi-Algebraic Abstraction [32]. The semi-algebraic abstraction of the dynami- 
cal system X= f(X) partitions its state space with respect to a set of polyno- 
mials A= {a1,...,@m}. The abstraction is the (explicit state) transition system 
Sa = oP Iya; Tta) where: 


— 3^ = {s = Aaea 4 P< 0 [<E {>, <, =}} is the set of abstract states; 

— Ipa = {s € 3^ | s A y is satisfiable} is the set of abstract initial states; and 

- Trac 3“ x 34 is the abstract transition relation. A transition (s1, s2) € Tf A 
if: 

e sı is an abstract state adjacent to sg. The abstraction exploits the con- 
tinuity assumption on f and does not allow the system to transition 
directly from a state where a predicate is greater than 0 (e.g., a > 0) toa 
state where the same predicate is less than 0 (e.g., a < 0), and vice-versa. 


t Later we also consider equalities (i.e., predicates of the form a = 0). The construction 
of Inf a=o(X) can be found in [12]. 

5 In our implementation we encode Ing ,a>0(X) using the remainders of the Lie deriva- 
tive, as in [12]. 
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The abstraction does not visit two abstract states containing predicates 
with opposite signs, forcing instead to visit the intermediate state where 
the predicate is equal to 0. 

e There exists a continuous trajectory from sı to s2. This condition corre- 
sponds to checking that the following differential dynamic logic formula 
is not valid (i.e. sı is not a differential invariant when restricting the 
evolution domain to s1 V s2): 


Ss, — [x = F(X) & S1 V $2]S1, 


which can be checked using the sound and complete LZZ algorithm, i.e. 
checking the satisfiability of the first-order formula ~LZZ,, fs,vs.(Z)). 


Since the number of states 3“ is finite we can compute the set of reachable states. 
The concretization of this set, 0 contains the initial states and is a differential 
invariant. If 0 further implies the safe states w, then we prove the safety verifi- 
cation problem 1. However, the computation of the abstract transition relation 
is exponential in the number of polynomials in A because we would need to 
enumerate all the possible pairs of transitions (s1, 52) € 3“ x 34. 


Predicate Abstraction 


A symbolic transition system S is a tuple S = (V, I, T), where V is a set of (state) 
variables, I(V) is a formula representing the initial states, and T(V,V’) is a 
formula representing the transition relation. A state s of S is an interpretation 
of the state variables V. A (finite) path x of S is a finite sequence 7 = s0, $1,-.-, Sk 
of states with the same domain and interpretation of symbols in the signature © 
such that so = J and for all 1,0 <i<k, s;,s/,, = T. We say that a state s is 
reachable in S iff there exists a path of S ending in s. Given a formula P(V) and 
a transition system S, the invariant verification problem, denoted with S = P, 
checks if for all the finite paths so, 51,...,s, of S, for all i, 0 < i < k, si H P. 

Predicate Abstraction [16] partitions the concrete system S = (V,J,T) 
according to a finite set of predicates P= {p1,...,p,} in a finite symbolic tran- 
sition system: 


Sp = (Vp, Ip(Ve), Te(Ve, Ve) 


using a new abstract Boolean variable v, for each predicate p (Vp = 
{Up | v € V} is the set of those new variables). The abstraction relation 
Hp(V, Ve) = Apep Up  P(V) defines how a set of concrete states is abstracted 
to the abstract states. We compute the abstraction of a formula w(V) by exis- 
tentially quantifying the concrete variables V: 


be(Ve) = AV.((V) A Hp(V, Ve)). 
Similarly, we compute the abstract transition relation for T(V, V”): 


Tp(Ve, Vp) = AV, V'(T(V,V") A Hp(V, Ve) A Hp(V", Vp)). 
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The above formulation is sufficient to compute the predicate abstraction for an 
infinite-state transition system S = (V,J,T) and a set of predicates P. However, 
the main challenge in computing the abstraction is to eliminate the quantifiers, 
since quantifier elimination is expensive to compute. 


Implicit Predicate Abstraction. Implicit Predicate Abstraction [35] is a model 
checking algorithm that avoids computing the abstract version of the initial 
states, safety property, and transition relation, instead it encodes the existence 
of a path in the abstract system. It exploits the fact that the abstraction induces 
an equivalence relation among concrete states of the system (i.e., two concrete 
states are equivalent if they belong to the same abstract state) and that this 
relation can be expressed as a quantifier free formula: 


EQ(V,V)= N pV) © pV). (6) 
pEP 


We use the equivalence EQp(V,V) to relate two sets of concrete states and 
we encode the problem of reaching a set of target states ~P in k steps of the 
transition system S as follows: 


BMC® = I(V°) A EQp(V°,V’) A 
—k—1 


A (TO, V») A BQ2(V",V")) ATV VEJA 
1<h<k 


EQp(V*,V") A (=P(V")). 


The formula BM (6 is satisfiable iff there exists a path in the abstract transition 
system Sp of length k starting from the (abstracted) initial states Ip(Vp) and 
reaching the (abstracted) bad states —Pp(Vp). 


4 Explicit Computation of the Semi-Algebraic 
Abstraction 


We frame the problem of computing the semi-algebraic abstraction as a predicate 
abstraction problem. This formulation allows us to use the standard techniques 
to compute or analyze the predicate abstraction for discrete systems. 

We consider the invariant verification problem Y > [X = f(X) & H]¢ as 
in Eq. (1) and a set of polynomials A = {a1,...,@m} for the abstraction. We 
construct a symbolic transition system of the semi-algebraic abstraction: 


Sp= (Vp, Te(Ve), Tp(Vp, Vp), 


where the set of predicates of the abstraction is P = {a0 |a E AA me {>,< 
,=}}, and the set of abstract variables Vp is defined as in Sect. 3 (i.e., the abstrac- 
tion contains a Boolean variable v, for each predicates p € P). We similarly use 
the formula Hp(X, Vp) to describe the equivalence relation of the concrete states. 
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The formulas Ip(Ve) and Z Pp(Vp) are the semi-algebraic abstraction of the initial 
states Y and of the unsafe states ~g: 


Te(Ve) = 3X.(Y(X) A He(X, Ve)), Z Pp(Ve) = 3X.(39¢(X) A He(X, Ve)), 


and we obtain the abstraction by existentially quantifying the concrete variables 
X. The definition of the abstract transition relation Tp(Vp, Vp), which differs 
from the encoding of the semi-algebraic decomposition, is: 


Tp(Ve, Ve) = 3X, X’. (xa X) A H(X) A H(X’) (7) 


Hp(X, Vp) A Hp(X', Vp) A 3Z.TA(X, X’, 2) ; 


where N(X, X’) encodes the adjacent relation between abstract states: 


N(X,X) = N ((a(X) <0 — a(X’) <0) A (a(X) > 0 > a(X") > 0), 
acA 


and Ta(X,X’, Z) encodes the existence of a transition in the dynamical system 
f for each pair of abstract states (s1, s2) € 34: 


DOCK Z> V (s1(X) A s2(X") A =LZZs,,f,s1vsa(2)}: (8) 


(s1,82)€34 


Theorem 1. The transition systems Sa and Sp are bisimilar. 


Corollary 1. Sp H ~ Pp(Vp) implies y > |X = f(X) & H]¢. 


Proof (sketch). The proof follows directly from Theorem 1. 


While the encoding of the transition relation To( Vex, Vz) is symbolic, it (and 
in particular the sub-formula Ta (X, X’, Z)) explicitly enumerates an exponential 
number of abstract pairs of states. Clearly, this encoding is not practical and 
defeats the purpose of using symbolic techniques to compute the abstraction. 


5 Linear Encoding of the Semi-Algebraic Abstraction 


Specializing the LZZ Formula for Checking Abstract Transitions 


The construction of the semi-algebraic abstraction uses the formula 
ALZZs, f s;Vs,(Z) to encode the existence of a transition from the abstract state 
sı to the abstract state s2. We observe that here the LZZ algorithm is applied 
to formulas with a specific structure — the abstract states s1(Z) and s2(Z), in 
contrast to arbitrary semi-algebraic sets as in the general case of LZZo 7,4 (X) 
where the formulas 0 and H are in DNF. Instead, in the case of LZZs,,f,sıvs2 (Z), 
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each abstract state s;(X) assigns a sign to each polynomial a € A and is rep- 
resented as conjunctions of predicates s; = a, X1 0A a2 X2 OA...dm Xm 0, 
where b;€ {>,<,=}. We will write the conjunction representing a state s;(X) 
as \ara9¢s, U(X) 0. Also note that the evolution domain constraints are also 
a disjunction of two abstract states sı V s2. 

We specialize Eq. (2) to the specific case of LZZs, f,5,vs,(Z). We will use 
such specialization to obtain a compact (linear in the number of polynomi- 
als) encoding later in the section. Instantiating the formula (2) to the case of 
LZZ; f,sıvsa(Z), we get: 


LZZs,,f 51Vs9(Z) = 
((s1(Z) A (81(Z) V 82(Z)) A Ing sıvs2(Z)) > Ings, (Z))A (9) 
(s1 (2) A (s1(Z) V 82(Z)) A In_g,0; vs (Z)) > 7In_$,5;(Z)) 
Applying the Boolean identities: (œ A (a V B)) a, (ma^ (avp) 47a B 
<=> ((81(Z) A Inf sıvs2(Z)) > Ings, (Z))A (10) 
((>81(Z) A s2(Z) A In-f sıvsa(Z)) > 7In_#,s, (Z)) 
Rewriting the implication and applying De Morgan’s laws: 
<=> (751 (Z) V 7Ing,5,V59(Z) V Ings, (Z))A (11) 
(s1(Z) V 782(Z) V =In-f,sıvs2 (Z) V In_f,5;(Z)) 
Expanding the definition of In(Egq. (3)) Anf avg = (In-f,a V Ing.) 
In_¢ avp =(In-f,a V In_#,8) 
<=> (751 (Z) V (Ing sı (Z) V Ing s3 (Z)) V Ings (Z))A (12) 
(s1(Z) V 752(Z) V =(In-f sı (Z) V In_# ,s9(Z)) V =In-f,sı(Z)) 
Applying the Boolean identities: (=(a V 8) V a) e (AB V a), (~(a V 8) V ~a) > ~a 
<> (751 (Z) V ang s3 (Z) V Ings, (Z))^ (13) 
(s1(Z) V 789(Z) V 7In_7,5, (Z)). 


Note that, while In does not distribute over arbitrary Boolean formulas 
(see [12]), when we expand the definition of Inf svs, (Eq. (12)), the formula 
sı V s2 is in DNF. Thus, Formula (13) is equivalent to the initial Formula (9) of 
LZZs, f,sıvsa(Z). We then write the negation of the Formula 13 as: 


WLZZ 5, , f 51V52(Z) = (s1(Z) ^ Ings (Z) A ang sy (Z))V (14) 
(781(Z) TAN s2(Z) A Mif a (Z)). 


Linear Encoding of the Semi-Algebraic Transition Relation 


In the following steps, we revise the formula Ta(X, X’, Z) that encodes the exis- 
tence of the transitions in the abstraction, still enumerating all possible pairs 
of states, using the specialized LZZ encoding from Eq. (14). We substitute the 
subformula =LZZ., 7 ,s;vs,(Z) with the specialized LZZ encoding (Eq. (16)); we 
then distribute the conjunction s;(X) A s2(X’) over the disjunction present in 
the definition of ~LZZ., f,s,vs,(Z) (Eq. (17)), and then over possible pairs of 
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states (Eq. (18)). We rename the two disjuncts in Eq. (18) as InsEzpl; (X, X’, Z) 
and OutExpl,(X,X',Z)) (Eq. (19)). The formulas InsExpl,(X,X',Z) and 
OutExpl,(X, X’, Z)) still enumerate explicitly the abstract states. However, each 
of these formulas is a conjunction of predicates, application of the Inf operator 
to a conjunction of predicates, and negations of the application of Ing. 


Ta(X, X',Z) = 
aZ. V (X) A 82(X’) ALZ f, sıvsa(Z)) (15) 
(s1,82)€34 
eae, y (PARA A elZ) A al Z) Ang (ZN 
(81,52) €34 (381(Z) A 82(Z) A In_z,s,(Z))) 
(16) 
es az, y (COA) N92) A roal) A =n a ZV 
(s1,82)€34 ((s1(X) ^ s2(X') ^ =sı(Z) ^ s2(Z) ^ In-f,s, (Z)) 
(17) 
= Vis wen) E34 (s1(X)As2(X')AsiA Ing s3 (Z) Ing s; (Z))V 
TA ( \ pee (s1(X)As2(X')Ans1As2AIn_¢,5,(Z)) (18) 
<> AZ.(InsExpl,(X, X',Z)V OutExpl,(X, X',Z)). (19) 


We now show how we obtain a formula InsEspl; (X, X’, Z) with a linear 
size. We expand the definition of the formula InsEzpl,(X,X', Z) with respect 
to the predicates in sı and sg. Recall that each abstract state is a conjunction 
of predicates obtained from the set of polynomial A (i.e., s = Naca a Xa 0, Xa E 
{>,<,=}) and that we use a %0 € s to enumerate the predicates in s. 


InsExpl;(X,X',Z)= V ( N aX)m0A A a(X’) rs 0A (20) 


81,82€34 acxl0€ 81 arl0€ so 


N Zjon AN Ing ara0(Z)A 


ax0Esı amxi0E s2 
VV “ing wo(2)) 
axi0E sy 


In the above formula, we used De Morgan rules to rewrite the formula 
7 Navaoes, IMF ,ara0(Z) as the formula V apaoes, 7/MF,a>00(Z). We express the for- 
mula InsExpl; (X, X’, Z) as an enumeration of the predicates, over the variables 
X and X’, determining the abstract states sı and s2, instead of the pairs of 
abstract states: 
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InsSymb,(X, X', Z) = A (a(x) m0 > a(Z) 0)A (21) 
acA,mie{>,<,=} 


NM (aX) 0> Ing av0(Z)) A 


acA,me{>,<,=} 


(a(x) Dd 0A (-Ing,a020(Z))). 
acA,mie{>,<,=} 


Lemma 1. InsExpl,(X,X', Z) and InsSymbs(X, X', Z) are equivalent. 


Proof (sketch). 

=>) We show that u = InsEspl; (X, X’, Z) implies u = InsSymb,(X,X’, Z). 
Since u = InsExpl; (X, X', Z) we have that p is an interpretation for one of the 
disjuncts on the possible pairs of states of InsExpl,(X, X’, Z): 


N Xaoo A a(X)o0A N a(Z) ra 0A 


ax0Esı amx0€E s2 ax0Esı 
A Ing arao(Z) A V ang av<o(Z). 
amXl0€ s2 amx0E€s1 


Hence, there exist two (and exactly two) abstract states 51,82, such that 
u = sı(X) and u — s2(X"). This means that any predicate a 0 ¢ sı is 
such that u + a œ% (X) and similarly for predicates not in the state s2 for 
the variables X’ (recall that, given a polynomial a € A, the possible abstrac- 
tion predicates a > 0, a < 0, and a = 0 are mutually exclusive). We show 
that u is an interpretation for all the conjuncts in InsSymb,(X,X’, Z). We 


have that u = Ageame{s,<, -} (a (X) x0 a(Z ) a 0) since for all a € A, 
= a(X) x 0 — a(Z) % 0 (when a € sı we have u | a(Z) œ% 
0, while when a ¢ sı the implication trivially holds). Similarly, this hap- 
(a (X) œx 0 > a(Z) x 0). We can see the disjunction: 


pens for Mey cers eee 


Vacape{>,<,=} (a(x) >< 0 A (“Inş a>o(2))) as: 


Vo (a(X) 0A (Ing av20(Z))) V V (aX) a 0A (Png aol 2))). 


arxi0€ sy ax0g sı 


We have that u satisfies the first disjunct (and hence the whole disjunction) 
because when a 0 € sı we have that u = Vaxocs, “Inf ,ar0(2). 

<) We show that u = InsSymb,(X, X’, Z) implies u H InsExpl; (X, X', Z). As 
before, we notice that are only two predicates s1, s2 such that u = sı(X) and 
u F s2(X') and that all the predicates not in sı and not in sz do not hold in yp. 
Thus, from u |= InsSymb,(X, X’, Z) we have that 


LE \ a(Z) >a 0 A \ Ing avao(Z) A VV ang aao(Z). 


aes, aEs2 aes, 


Hence, u is a model for at least one of the disjuncts in InsExpl,(X,X’, Z). 
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We similarly define the compact encoding of OutEzpl,(X,X’, Z): 


OutSymb,(X,X', Z) = A (a(x) w0 > In- f,a0(2)) A (22) 
ac€A mie {>,<,=} 
A (a(X") >a 0 — a(Z) ox 0)A 
ac€A mie {>,<,=} 
y (ax) ba 0 A ~a(Z) ox 0). 
acA,xE{>,<,=} 


Lemma 2. OutEzpli (X, X', Z) and OutSymb,(X,X', Z) are equivalent. 


Proof. The proof of Lemma 2 is similar to the proof of Lemma 1. 


We now express the transition relation from Eq. (7) in a compact form: 


Tsymbp(Ve, Vg) =3X, X'. (xa, X) A H(X) A H(X)A (23) 


H(X, Vp) A Ay(X’, VASA 


AZ.(InsSymb 5(X,X', Z) V OutSymb,(X, X’, )) ; 


Theorem 2. Tp(Vp, Vg) and Tsymbp(VP, Vg) are equivalent. 


Proof. Follows directly from Lemma 2 and Lemma 1. 


Implicit Semi-Algebraic Abstraction 


The formula Toymbp(Ve, Vx) represents the transition relation of the semi- 
algebraic abstraction. Computing the finite-state transition system representing 
the semi-algebraic abstraction requires to eliminate the existential quantifiers 
from the initial states, transition relation, and safety property formulas. How- 
ever, the above formula Tsymbp(Vp, Vg) contains non-linear real arithmetic terms 
from the polynomials and the Lie derivatives we compute in Inf, so removing the 
quantifiers from the formula requires to apply a quantifier elimination algorithm 
(e.g., Cylindrical Algebraic Decomposition [8]) that does not scale, even when 
the number of polynomials is small. Instead, we construct a symbolic transition 
system that implicitly encodes the abstraction: 


SImp p = (X U XUZ. W(X) A EQ (X, X), TimpiP(X X',Z) A EQp(X',X"), 
where 
Trin, X', Z) = NÆ, X’) A H(X) A H(X 
(InsSymbs (X, X’, Z) V OutSymb,(X, X’, Z)). 


The above encoding is a an implicit predicate abstraction [35] that preserves 
reachability properties and is such that: 
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— aA 


Theorem 3. Simpie = P(X) if and only if Sp K 77 Pp(Ve). 


Thus, we can model check the transition system Sjmpip | P(X) to prove a 
property P holds on the dynamical system. Note that, to this purpose, we can 
apply standard SMT-based model checking algorithms. 

The transition system Sjmpip doubles the state space introducing a copy 
of the state variables X and encodes the equivalence relation between pairs 
of concrete states in X and in X with the formula EQp(X, X) (c.f. For- 
mula 6). The transition relation Tympip(X,X', Z) then encodes a transition in 
the semi-algebraic abstraction with the linear encoding InsSymb, (X, X', Z) and 


OutSymb; (X, X’, Z). In this way, a transition in the transition system Simpip 
corresponds to a transition in the semi-algebraic abstraction, and vice-versa. 


6 Experimental Evaluation 


Research Questions 


We evaluate the performance of our approach (Implicit Abstraction) for the 
verification of invariant properties on the semi-algebraic abstraction of dynamical 
systems. Implicit Abstraction first encodes the semi-algebraic abstraction in a 
transition system (as we show in Sect. 5), and then model checks the invariant 
on the transition system with an off-the-shelf model checker. Our experiments 
aim to answer the following research questions: 

RQ 1: How does Implicit Abstraction compare with the LazyReach algo- 
rithm [32], which explicitly enumerates the reachable states of the abstraction? 
RQ 2: How does Implicit Abstraction compare with the DWCL algorithm [32], 
which applies a divide-and-conquer strategy to reduce the number of polynomials 
in the abstraction? 


Experimental Setup 


We implemented the construction of the implicit abstraction transition system 
in Python using PySMT [11] to manipulate formulas, and SymPy [23] for poly- 
nomial manipulation and Grébner bases computation (i.e., to compute the Lie 
derivatives’ ranks). We verify the implicit abstraction transition system with the 
model checking algorithm for symbolic transition systems with NRA constraints 
from [4]. The algorithm abstracts the non-linear transition system into a linear 
transition system, which is checked by the algorithm in [6] and is implemented 
using the MathSAT [7] SMT solver. We implemented both the LazyReach and 
the DWCL algorithms in the same Python tool. Our implementation of DWCL 
can use different backends to decide the satisfiability of NRA formulas, namely 
MathSAT®, the z3 SMT solver [25], or Mathematica [17]. 

We consider 90 invariant verification problems for dynamical systems from 
the KeyMaera X theorem prover [10]. These problems are a superset of the 


ê MathSAT uses a different decision procedure [4] than 23 and Mathematica based on 
incremental linearization rather than cylindrical algebraic decomposition. 
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ones used in [32] and are used in the Applied Verification of Continuous and 
Hybrid Systems (ARCH) competition [24]. We obtain a total of 180 benchmark 
instances using, for each problem, two sets of polynomials for the semi-algebraic 
abstraction”. The first set contains all the factors of the right-hand side of the 
ODEs; the second set extends the first one by including also the Lie derivatives 
of the polynomials. The latter set induces an abstraction that is more precise 
but also has a larger state-space. 

We evaluate the performance of the algorithms Implicit Abstraction, 
LazyReach, and DWCL to solve the above verification problems. The underlying 
problem requires to decide the satisfiability of NRA formulas, and the decision 
procedures for this problem are efficient for different subsets of problems. For this 
reason we further evaluate different configurations of the LazyReach and DWCL 
algorithms using three different solvers for NRA formulas (MathSAT, 23, and 
Mathematica). Note that, while in principle, we could use multiple SMT back- 
ends also in the model checking algorithm [4] and replace the MathSAT SMT 
solver with another SMT solver (e.g., 23), this change would not significantly 
impact the overall performance, because the algorithm abstracts the non-linear 
formulas with linear ones where both MathSAT and z3 have comparable perfor- 
mance. 

We run the Implicit Abstraction, LazyReach, and DWCL algorithm on all 
the 180 benchmark instances with a time out of 100 seconds, and we measure 
the execution times to either prove (safe result) or find an abstract counterex- 
ample (unknown result) for each instance. An archive containing the necessary 
to reproduce the experiments is available online at http://www.sergiomover.eu/ 
cav2021.html. 


Results 


RQ 1 - Implicit Abstraction vs. LazyReach. From the cumulative plot in Fig. 2, 
we see that Implicit Abstraction almost always outperforms LazyReach. 

From the cumulative plot in Fig. 2a we see that Implicit Abstraction signif- 
icantly outperforms LazyReach on safe instances. For better readability, in the 
plot we only show the (virtual) portfolio algorithm running each configuration 
of LazyReach, Virtual Best LazyReach, obtained by considering the best run 
time among the different configurations of LazyReach using different backend 
solvers. Virtual Best LazyReach solves a total of 42 safe instances, while Implicit 
Abstraction solves 100 safe instances. The scatter plots shown in the first row 
of Fig.3 confirms the same intuition (note that the safe instances represented as 
blue circles are mostly in the lower-right triangle of the plot). 

Figure 2b shows the cumulative plot when verifying unknown instances. Note 
that the total number of unknown instances in the benchmarks are much smaller 
than the safe ones (combining the results of all the algorithms we have 123 
safe instances, 19 unknown instances, and 38 still unsolved instances). From 


T The benchmarks have 321 sign-invariant polynomials (c.f. Sect.2) over a total of 
1089 polynomials that DWCL will use to split the state space. 
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Fig. 2b, we see that the performance of Implicit Abstraction is comparable with 
LazyReach, solving a total of 8 instances and 11 instances respectively. 


RQ 2 - Implicit Abstraction vs. DWCL. From the cumulative plots in Fig. 2, the 
Virtual Best DWCL solves 37 more instances than Implicit Abstraction. How- 
ever, we also see from Fig. 2 that the global Virtual Best solves more instances 
and is faster than Virtual Best DWCL. In fact, Implicit Abstraction is orthogonal 
to DWCL and is comparable to DWCL when fixing either Mathematica or z3 
(Implicit Abstraction solves 108 instances, DWCL Mathematica solves 109, and 
DWCL 28 solves 114). 

The scatter plots in the second row of Fig.3 compare Implicit Abstraction 
with DWCL MathSAT, DWCL Mathematica, and DWCL z3. From these plots, 
we see that there are several instances that are solved by only one of the two algo- 
rithms compared in each plot. While we see similar data when comparing Implicit 
Abstraction with Virtual Best DWCL (always in the scatter plots of Fig.3), the 
number of instances solved uniquely by Implicit Abstraction seems smaller. We 
get a more precise picture of the complementarity of Implicit Abstraction, DWCL 
Mathematica, and DWCL z3 from the diagrams in Fig. 4, where we can clearly 
see that Implicit Abstraction is orthogonal to both DWCL Mathematica and 
DWCL z3. From the diagram, we see that when using a different backend (i.e., 
Mathematica or z3) DWCL solves a different set of instances. This difference in 
performance using Mathematica and z3 is not surprising since Mathematica and 
z3 uses different algorithms to solve formulas in NRA. 

We further notice that Implicit Abstraction uses the MathSAT SMT solver 
in the backend, and from our experiments (see again Fig.3) DWCL MathSAT 
performs quite poorly compared to both DWCL Mathematica and DWCTL 23. 
While naively replacing MathSAT in the model checking algorithm we use [4] 
would not provide a significant performance improvement, it is reasonable to 
think that investigating a tighter integration with either 23 or Mathematica could 
improve the model checking performance. However, we believe this integration 
to be beyond the scope for this paper, where we enable the use of symbolic model 
checking techniques to analyze the semi-algebraic decomposition. 


7 Related Work 


In this work, we focus on the (unbounded time) safety verification problem for 
polynomial dynamical systems. Such problem is relevant when proving safety for 
hybrid programs [27] with Keymaera X [10] or for hybrid CPS with the HHL 
Prover [36]. Our reduction to transition systems may be used as sub-procedure 
in both theorem provers to automate the search of a continuous invariant. 
There exist different techniques to prove safety properties for polynomial 
dynamical systems (see e.g., [13]): barrier certificates [18,29], first integrals [14], 
and Darboux Polynomials [15]. All these techniques are orthogonal to semi- 
algebraic abstraction, and can be used to find invariant polynomials to restrict 
the abstract state space. Pegasus [33] implements all the above techniques, the 
LazyReach, and DWCL algorithms. Our algorithm can be integrated in Pegasus. 
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Fig. 2. Plots the total number of instances (on the y axis) as a function of the cumu- 
lative time (in seconds, on the x axis) took by Implicit Abstraction, LazyReach, and 
DWCL to solve (a) safe and (b) unknown instances. The comparison includes the 
results of LazyReach and DWCL using different (MathSAT, z3, and Mathematica), as 
well as virtual portfolios combining the best results obtained by a given algorithm when 
run with multiple backends. We omit some configurations in (b) to improve readability. 


The LZZ [22] procedure has been originally proposed to synthesize a continuous 
invariant. Instead, we use the LZZ procedure to encode the abstract transition 
relation, and then we prove a safety property in the abstraction. We also provide 
a specialized encoding of LZZ to check the existence of abstract transitions. 
The semi-algebraic abstraction [32] is a qualitative abstraction [34,37]. In 
this work, we propose a different algorithm to verify semi-algebraic abstractions 
that allows us to explore the abstract state-space symbolically, in contrast to 
the LazyReach algorithm [32]. In principle, our technique is orthogonal to the 
DWCL algorithm [32], since we could replace LazyReach, which is used in DWCL 
as a sub-routine, with our approach (i.e., model check the implicit abstraction). 
Relational abstraction [31] abstracts the dynamical system’s trajectories with 
a discrete transition relation, reducing the verification problem on the continuous 
system to a verification problem on the discrete system. The implicit encoding of 
the semi-algebraic abstraction can be seen as an instance of relational abstrac- 
tion, where a trajectory of the dynamical system is mapped to a sequence of 
abstract transitions (similarly to what happen with relational abstractions for 
time-sampled systems in [2,38]). Since relational abstractions can be composed 
with each other (e.g., see [26]), we can strengthen the implicit semi-algebraic 
abstraction encoding with a relational abstraction. This composition is useful 
in the case the semi-algebraic abstraction cannot easily capture the system’s 
behavior (e.g., a precise relation of the time elapsed in a transition [26]). 
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Fig. 3. Scatter plots comparing the run time (in seconds) of Implicit Abstraction (on 
the y axis) with LazyReach (first row, on the x axis) and DWCL (second row, on the x 
axis). Blue circles represent safe verification problems. Red crosses are instances where 
the algorithm found an abstract counterexample. When Implicit Abstraction runs for 
more than the 100s time out, we plot the instance on the vertical line marked as to, 
and similarly for LazyReach and DWCL on the horizontal line. 
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Fig. 4. Diagrams representing the distribution of unique instances solved combining 
different algorithms (DWCL Mathematica, DWCL z3, and Implicit Abstraction). Each 
set, displayed as a dotted circle enclosed by a dotted line, represents the set of instances 
solved with one algorithm. The number shown in each partition is the number of 
instances solved uniquely by the sets forming the partition. For example, the central 
partition (i.e., the intersection of all the sets) of the diagram (a) shows that DWCL 
Mathematica, DWCL z3, and Implicit Abstraction solved the same set of 141 instances. 


Predicate abstraction [16] isa commonly used abstraction techniques to verify 
infinite-state systems. Several symbolic techniques [3, 19, 20] focus on the efficient 
computation of the predicate abstraction. In principle, we can also use those 
technique to explicitly compute the semi-algebraic abstraction. However, the 
up-front, explicit computation of the abstraction is a bottleneck and can be 
avoided with implicit predicate abstraction [35] when the goal is to verify a 
safety property on the abstract system. We use implicit abstraction to obtain 
an implicit encoding of the semi-algebraic abstraction. The transition system of 
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the semi-algebraic abstraction contains NRA formulas (the polynomials can be 
non-linear or the Lie derivative of the polynomials are non-linear). While there 
are few algorithms and tool that can verify such transition systems (e.g., [4]), 
our technique is agnostic to the underlying model checking algorithm. 


8 Conclusions and Future Work 


In this paper, we addressed the safety problem of polynomial dynamical systems. 
We built on the LZZ algorithm to define a symbolic encoding of the abstraction 
based on a set of polynomials. The encoding is linear in the number of polynomi- 
als and can be used to implicitly represent the abstraction without the need of 
enumerating the abstract states, enabling the use of SMT-based model checking 
techniques. The experimental evaluation showed that the approach is promising 
and complementary to existing techniques solving a number of new instances. 
The main directions for future works are, on one side, refining the abstraction 
discovering new polynomials that are able to remove spurious abstract counterex- 
amples, and, on the other side, the application of the approach to hybrid systems 
where the continuous dynamics depends on the discrete state of the system. 
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Abstract. Real-time systems are notoriously hard to verify due to non- 
determinism, concurrency and timing constraints. When timing con- 
stants are uncertain (in early the design phase, or due to slight vari- 
ations of the timing bounds), timed model checking techniques may not 
be satisfactory. In contrast, parametric timed model checking synthe- 
sizes timing values ensuring correctness. IMITATOR takes as input an 
extension of parametric timed automata (PTAs), a powerful formalism 
to formally verify critical real-time systems. IMITATOR extends PTAs 
with multi-rate clocks, global rational-valued variables and a set of addi- 
tional useful features. We describe here the new features and algorithms 
offered by IMITATOR 3, that moved along the years from a simple proto- 
type dedicated to robustness analysis to a standalone parametric model 
checker for timed systems. 
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1 Introduction 


Real-time systems are often used in critical environments, and may be verified 
using formal methods. Such systems are notoriously hard to verify due to nonde- 
terminism, concurrency and timing constraints. Timed model checking provides 
designers with techniques to formally verify a real-time system. However, timed 
model checking may not always be fully satisfactory: First, in the early design 
phase, timing constants may not be known and, without them, model checking 
is not possible; Second, at runtime, timing constants may vary (due to uncertain 
bounds, or to processor clock drifts), in which case the model checking result 
may not hold anymore. In contrast, parametric timed model checking synthesizes 
timing values ensuring the system correctness. 

Parametric timed automata (PTAs) are a powerful formalism to reason 
about, and formally verify critical real-time systems [5]. PTAs are finite state 
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Fig. 1. Examples of graphical outputs 


automata extended with clocks, i.e., real-valued variables evolving linearly, that 
can be compared with either integer constants or parameters in guards (con- 
straints to take a transition) and invariants (constraints to remain in a location). 
IMITATOR takes as input networks of “IMITATOR PTAs” (IPTAs) extending 
PTAs with several convenient features such as stopwatches, multi-rate clocks or 
global shared rational-valued variables. 
IMITATOR answers variants of the following problem: 


Parameter synthesis problem: 
INPUT: A network of IPTAs A and a specification y 
PROBLEM: Synthesize the set of parameter valuations for which A satisfies y 


IMITATOR answers this problem by synthesizing sets of parameter valuations 
in the form of a finite disjunction of linear constraints over the parameters. 

IMITATOR is a command-line only tool; its input is text-based (partially 
inspired by Hy TECH syntax [41]) and is “human-readable”, different from, e.g., 
XML. IMITATOR produces standardized result files (that can be possibly parsed 
from external tools), and can produce graphical outputs, such as in Fig. 1. 

The expressive power (i.e., ease to write a complicated model in a compact 
manner) of the tool has been largely improved since IMITATOR 2.5 [17], and 
IMITATOR is now a parametric timed model checker taking as inputs a model 
and a property, implementing various synthesis algorithms. 


2 An Expressive Input Language 


Parametric Timed Automata (PTAs). Timed automata (TAs) [3] extend finite- 
state automata with clocks, i.e., real-valued variables evolving at the same rate 1, 
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that can be compared to integers along edges (“guards”) or within locations 
(“invariants”). Clocks can be reset (to 0) along transitions. PTAs extend TAs 
with (timing) parameters, i.e., unknown rational-valued constants [5]. 


Example 1. In the model in Fig. 2 (that goes beyond the syntax of PTAs, see 
Example 2), there are four locations, depicted as rounded rectangles. Invariants 
are depicted using dotted rectangles. In the invariant of location working, clock x 
is compared to parameter Pjoici. The guard of the transition from coffee to 
working compares clock t to Pcoffee; this clock t is reset to 0 along this transition. 


IMITATOR Parametric Timed Automata (IPTAs). IMITATOR takes as input 
models described as networks of IMITATOR parametric timed automata 
(IPTAs). IPTAs extend PTAs with a set of useful features, described in the 
following. 


Global Rational-Valued Variables. Global variables (called “discrete”) can be 
defined, and are part of the discrete part of a state, together with locations 
(and different from clocks and parameters that are part of the continuous part). 
Global variables in IMITATOR are exact rationals, following exact arithmetics 
(as opposed to, e.g., floating-point arithmetic that can accumulate errors and 
lead to faulty assertions). Exact rationals are encoded in IMITATOR using the 
GNU MP library. Such discrete variables can be updated along transitions, and 
can also be part of the clock guards and invariants; in fact, virtually any linear 
expression over clocks, parameters and discrete variables can be used in guards, 
invariants and updates. Non-linear arithmetic expressions over sole discrete vari- 
ables are allowed too. 


Automata Synchronization. IPTAs can be synchronized together on shared 
actions, or by reading shared variables. All variables (clocks, parameters, dis- 
crete) are potentially global in IMITATOR. This allows users to define models 
component by component. 


Arbitrary Flows. Since version 3.0, IMITATOR supports arbitrary (constant) 
flows for clocks; this way, clocks do not necessarily evolve at the same time, and 
can encode different concepts from only time: temperature, amount of comple- 
tion, continuous cost... Their value can increase or decrease at any predefined rate 
in each location, and can become negative. In that sense, IMITATOR’s clocks are 
closer to continuous variables (as in hybrid automata) rather than TAs’ clocks; 
nevertheless, we keep the name clock for sake of backward-compatibility. This 
makes IMITATOR support a parametric extension of multi-rate automata [2]. 
This notably includes stopwatches, where clocks can have a 1- or 0-rate [36]. 


Additional Syntax Improvements. Beyond the aforementioned increase of the 
syntactic expressive power, the syntax was enhanced with accepting locations 
(that can be used in properties), global constants, “if... then... else” con- 
ditions in updates, and with the ability to include model fragments from different 
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T< Ptotal t > Pneed 
estare da Wea ei 
te mge drink 
nb 4+ nb+1 


done 


finished 


x > 0.8 X Piotal 


£T = Ptotal t > 0.6 X Pneed 
done Anb < maz — 1 
drink 
t0 
2 < Ptotal nb <— nb+1 


Fig. 2. An IPTA example: writing papers and drinking coffee 


files (new syntax #include(modelpart.imi)). Several simplifications were made 
to the syntax to keep it “human-readable”. For example, location workingFast of 
Fig. 2 is written in IMITATOR syntax as follows: 


loc workingFast: invariant x <= pTotal flow{x’ = 2} 


Translations. Finally, translations of the model are available to other model 
checkers such as HyTECH [41] and UPPAAL [42] (in both cases, not all features 
can be translated since some of the features of IMITATOR do not exist in the 
target tool, e.g., UPPAAL does not support parameters nor complex linear con- 
straints over clocks (only “diagonal” )). Graphical translations of the model are 
also available to JPEG, PDF and ATEX formats. 


Example 2. Consider the IPTA in Fig. 2, modeling a researcher writing papers. 
The model features two clocks t (measuring the time when needing a coffee) 
and x (measuring the amount of work done on a given paper), both initially 0. 
Their rate is always 1, unless otherwise specified (e.g., in workingFast). Initially, 
the researcher is working (location working) on a paper, requiring an amount of 
work pioaj- When the paper is completed (guard x = Protal), the IPTA moves 
to location finished. From there, at any time, the researcher can start working 
on a new paper (transition back to working, updating x and t). 

Alternatively, after at least a certain time (guard t > pneew), the researcher 
may need a coffee; this action can only be taken until a maximum number of 
coffees have been drunk for this paper (nb < max — 1), where nb is a dis- 
crete global variable recording the number of coffees drunk while working on 
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the current paper. When drinking a coffee (location coffee), the work is obvi- 
ously not progressing (+ = 0). Drinking a coffee takes exactly peofee time units 
(guard t = Peoffee back to location working). Observe that, from the second paper 
onwards (transition labeled with restart), the researcher is already half-way of 
her/his need for a coffee (parametric update t — 0.5 X Pneea [22]). 

Also, whenever 80% of the work is done (guard x > 0.8 piotai), the researcher 
may work twice as fast (location workingFast, with a rate 2 for clock x). In that 
case, (s)he needs a coffee faster too (0.6 X Preea)- 

All three durations peoffee, Pneed ANd Piotal are timing parameters. We fix 
their parameter domains as follows: Pcoffee, Ptotal E [0,00) and Pneea € [1,00). 
The maximum number of coffees max € [0, 00) is also a parameter; observe that 
it is (only) compared to the discrete variable nb, and therefore can be seen as a 
“discrete parameter”—which is allowed by the liberal syntax of IMITATOR. 


The example in Fig. 2 could not be modeled with UPPAAL due to the presence 
of timing parameters, stopwatches, multi-rate clocks and non-0 update. It may be 
modeled using Hy TECH; however, most algorithms implemented in IMITATOR 
(even the most basic ones, such as liveness synthesis) do not exist in HyTECH, 
as Hy TECH mainly focuses on basic state space computation. 


3 A Variety of Synthesis Algorithms 


The formalism of networks of IPTAs is “highly undecidable” for most problems. 
Indeed, while several problems are decidable for timed automata (notably the 
reachability [3]), most interesting problems become undecidable in the presence 
of timing parameters [5,8] , notably when such parameters are unbounded [35]. 
On top of this, multi-rate automata together with linear constraints over the 
clocks also yield undecidability [2]. Finally, the mere use of stopwatches, even 
without the aforementioned extensions, brings undecidability [36]. Also note 
that, in contrast to several existing model checkers, IMITATOR offers the use 
of unbounded rational variables, therefore with an infinite domain. For all these 
reasons, it is always possible to find examples of IPTAs for which the algorithms 
implemented in IMITATOR would not terminate with an exact (sound and com- 
plete) result. The rational behind IMITATOR is therefore to follow a “best-effort” 
approach, by: 


— using aggressive optimizations and abstractions (e.g., [11,19,45]), leading to 
termination for most case studies in practice; 

— outputting over- or under-approximated results, i.e., the set of synthesized 
parameter valuations may be larger or smaller than the exact result. 


IMITATOR outputs a standardized result (in a text file), that contains the syn- 
thesized constraint with a set of information, and notably the validity of the 
constraint, i.e., whether the set of valuations is exact (sound and complete), pos- 
sibly over-approzimated, possibly under-approximated, or potentially invalid i.e., 
when both under-approximating and over-approximating heuristics were used. 
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By default, IMITATOR attempts to synthesize an exact result; only when some 
specific options are used (e.g., a limit on the number of states explored, or on 
the computation time), approximations may be used. These approximations are 
conservative for most algorithms; for example, if an approximation is used for 
safety synthesis, then the result will be under-approximated (i.e., the system is 
safe for all synthesized valuations—even though some more safe valuations may 
exist). 

IMITATOR offers two main classes of synthesis: i) Witness (or counter- 
example), which attempts to exhibit at least one parameter valuation satisfying 
the property; often, IMITATOR still outputs a symbolic set of valuations (i.e., 
a linear constraint over the parameters), but stops the analysis as soon as one 
such set is found. ii) Normal synthesis, where IMITATOR attempts to synthesize 
all parameter valuations satisfying the property. 

Properties include reachability (denoted by “EF”, following the TCTL syn- 
tax), safety (denoted by “AGnot”), liveness, deadlock-freeness, robustness, and 
some others. 

Throughout this section, we exemplify the main synthesis algorithms of IMI- 
TATOR on Example 2.! All the results synthesized in the following are exact 
(sound and complete), unless otherwise specified. 


Safety. A first algorithm of IMITATOR is safety synthesis, i.e., synthesizing 
parameter valuations for which a discrete state (location and/or valuation of 
the discrete variables) is unreachable for all runs. For example, one synthesize 
the valuations for which it is impossible to drink any coffee, i.e., it is impossible 
to reach the coffee location of the “researcher” automaton of Fig. 2. 


#synth AGnot(loc[researcher] = coffee) 


Pneed 
10) 

Let us explain this result. The first disjunct is trivial: if the researcher is not 
allowed to drink any coffee (max < 1), the transition to coffee (guarded by 
“nb < maz — 1”) can never be taken. The second disjunct is, despite the relative 
simplicity of this model, less trivial: assume for illustration that Prneea = 10 and 
Ptotal = 1, and let us show that the researcher is still able to start drinking 
a coffee in this situation. After the first paper completion (action restart), we 
have x — 0 and t — 5. After one time unit in location working (x = 1 and 
t = 6), the researcher moves to workingFast, and can immediately move to coffee 
(guard t > 0.6 X Pneea is now satisfied). This scenario, that can be seen on the 
parametric state space output by IMITATOR (see Fig. la), is also possible for 
Pneed 


larger values of Protal. This explains the strict inequality Protal < ETE 


The result is: mar € [0,1) V (mar > 1A Ptotal < 


1 All finishing executions for our example using IMITATOR 3.0 “Cheese” ea560fd on a 
Dell XPS 13 7390 Intel® Core™ i7-10510U CPU @ 1.80 GHz running Linux Mint 
20 Ulyana terminate within < 1s. All examples and results can be found at [9]. 
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Reachability. Reachability can be seen as the opposite of safety, i.e., the goal is 
to synthesize parameter valuations for which a discrete state is reachable for at 
least one run. For example, one can ask for the valuations for which it is possible 
to drink at least one coffee: 


#synth EF(loc[researcher] = coffee) 


The result is mar > 1A Protal > ere which is obviously the complement of the 
result synthesized for the aforementioned safety property. 

One can also synthesize valuations for which it is possible to drink at least 
five coffees while working on some article (i.e., nb > 5). 


#synth EF(loc[researcher] = coffee & nb >= 5) 


The result is mar > 5A piotal > 3r X Pred: 


Minimum-Time Reachability. Minimal-time synthesis [12] aims at synthesizing 
parameter valuations minimizing the time needed to reach a discrete state. Here, 
we can ask for the valuations for which it is possible to finish an article after 
drinking at least 2 coffees: 


#synth EFtmin(loc[researcher] = finished & nb >= 2) 


The result is — + Pneed +2 X Deoffee < 2A max > 2 and the minimal time is 2. 
That is, any of these valuations guarantee the reachability of a state where the 
researcher has drunk 2 coffees, and the minimum time is 2 (recall that Pneed € 
[1,00)). 


Optimal Parameter Reachability. One can ask here for the valuations for which 
the value of a given parameter is minimized or maximized when reaching a given 
state. Let us ask for the valuations minimizing the value of Ptota} when finishing 
a paper after drinking (at least) 3 coffees. 


#synth EFpmin(loc[researcher] = finished & nb >= 3, pTotal) 


The result is mar > 3 A piotal = 2-1 A Pneea = 1. Observe that Peofee is not 
involved in this constraint (contrarily to minimum-time synthesis); indeed, the 
time spent in drinking coffee does not impact the total duration of the work 
(Ptotal); as the progress of clock x is stopped in coffee. 


Parametric Deadlock Freeness. Deadlocks are states in which no discrete action 
can be taken, and time cannot elapse (“timelock”). Such situations may denote 
ill-formed models. IMITATOR offers an algorithm [7] synthesizing parameter val- 
uations for which the model is deadlock-free. In case of “early termination” 
(predefined bound on the depth of the state space or on the computation time), 
a backward procedure synthesizes a subset of correct (deadlock-free) valuations. 


#synth DeadlockFree 
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For this property, the analysis does not terminate, as the state space is infinite 
(unbounded rational-valued parameters, unbounded variable nb) and IMITATOR 
needs to explore it as a whole to deduce deadlock-freeness for our example. 

Adding a bound on the depth of the state space (option -depth-limit 40) 
yields termination, with a pair of constraints: an under-approximated positive 
constraint (i.e., valuations that are guaranteed to be deadlock-free) max < 16 V 
(maxr > 16 A Piotal < Y Pneed), and an over-approximated negative constraint 
(i.e., valuations that might be deadlocked) max > 16A Protal > H Pneed: Observe 
that both constraints are complementary, i.e., IMITATOR is sure that the former 
set is deadlock-free, and is not sure that the latter set contains deadlocks. (Note 
that, in fact, the model is very likely to be deadlock-free for all valuations, even 
though IMITATOR is not able to show it.) 


Liveness Synthesis. A new feature of IMITATOR 3 is cycle synthesis, i.e., param- 
eter valuations for which there exists an infinite run, possibly passing infinitely 
often by a given discrete state (Büchi condition). IMITATOR uses by default 
an original algorithm by Laure Petrucci and Jaco van de Pol based on NDFS 
extended with parametric subsumption and pruning [45] (other algorithms, such 
as BFS, are also available [11]). In our running example , one can ask for the 
valuations for which the researcher infinitely often writes papers after drinking 
(at least) 3 coffees for each of them. 


#synth CycleThrough(loc[researcher] = finished & nb>=3) 


The result is maz > 3A Ptotal > 2.1 X Pneed- 


Robustness. Inherited from earlier versions of IMITATOR, one can apply the 
inverse method [29] (also called trace preservation [21]) that, given a reference 
parameter valuation, synthesizes the set of parameter valuations for which the 
set of “traces” (discrete behaviors, i.e., abstracting time information away) is 
the same as for this reference valuation. 


#synth TracePreservation(pTotal = 10, pNeed = 5, pCoffee = 3, max = 3) 


The result is: (3 X Pneed > Ptotal > 2 X Pneead A max E [2,3)) V (2.1 x Puad > 
Ptotal È 2 X Pneea A maz > 3). The synthesized constraint can be seen as a 
characterization of the robustness of the original parameter valuation. 


Synthesis Using Patterns. Another way to specify properties is to use a set of pre- 
defined observer patterns [6,28]. Observer patterns are translated into observer 
automata (called reachability testing in [1]), and their correctness reduces to 
reachability. This procedure is transparent to the user, i.e., (s)he only needs to 
specify the pattern and IMITATOR takes care of the translation and synthesis. 
IMITATOR patterns specify the order between actions, extended with (possibly 
parametric) timing information. The syntax is detailed in the user manual, and 
the semantics is given in [6]. 

For example, one can synthesize the set of valuations such that, every time 
the researcher restarts a new article, (s)he completes it within 5 time units. That 
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is, every occurrence of the restart action must be followed within (at most) 5 
time units by the done action. 


#synth pattern(everytime restart then eventually done within 5) 


A part of the valuations set is: max > 6A5—6 X pPeoffee = Ptotal > 4-7 X Pneed: 
A graphical 2D representation projected onto Protal aNd Peoffee (setting 
Pneed = 2 and max = 3) is given in Fig. 1b. 


Other Algorithms. IMITATOR features a number of additional algorithms, 
including i) non-Zeno infinite run synthesis [27], ii) behavioral cartography [16] 
that partitions the parameter space into tiles where the discrete behavior is uni- 
form, or iii) parametric reachability preservation, that takes as input a discrete 
state and a reference valuation, and synthesizes valuations for which this dis- 
crete state is reachable iff it is reachable for the reference valuation [25]. The two 
latter algorithms can be distributed over a cluster, showing interesting results, 
and can be used to perform reachability synthesis while being faster than the 
normal reachability synthesis algorithm for some benchmarks [14,15]. Finally, 
compositional verification for a subclass of IPTAs (a parametric extension of 
event-recording automata [4]) was proposed in [24]. 


4 Distribution 


IMITATOR is distributed under the terms of the GNU General Public License. 
Its source code is therefore publicly available, and benefited from several contrib- 
utors’ additions. IMITATOR is available online”, together with its documentation, 
and a benchmarks library [26]. 

IMITATOR depends on several libraries. Notably, the core engine relies on 
the Parma Polyhedra Library (PPL) [32] for the computation of symbolic states. 
As a consequence, IMITATOR can be cumbersome to compile. For this reason, 
standalone binaries are available for all Linux-like systems. A Docker version? 
(made by Jaime Arias) and a prototype Web service* are available too. 

An extensive user manual, explaining all algorithms and providing users with 
a full description of the input syntax for models and properties, is available [10]. 


5 A Selection of Applications 


IMITATOR was applied to a variety of both academic and industrial case stud- 
ies over the last few years. These applications range within several domains, 
including real-time systems, testing and monitoring, cybersecurity, or hardware 
verification. One can cite: 


? https: //www.imitator.fr. 
3 https: //hub.docker.com/r/imitator/imitator/. 
* https: //imitator.lipn.univ-paris13.fr/. 
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— the parametric verification of an asynchronous memory circuit by ST- 
Microelectronics (from a model described in [37]), 

— verification of parametric scheduling problems by Astrium Space Transporta- 
tion [40] and ArianeGroup SAS [13], 

— analysis of music scores [38], 

— verifying the multi-processor image processing system of an unmanned aerial 
aircraft with uncertain periods, as a benchmark made public by Thales [46], 

— parametric pattern matching and monitoring of logs from the automative 
industry [20], 

— synthesis of timing/cost parameters in attack-fault trees [23,31], 

— testing product lines using parametric constraints [44], 

— verification of an industrial asynchronous leader election algorithm by Thales 
using IMITATOR combined with abstractions [18], 

— performing parametric opacity analyses for timed automata [30], and 

— synthesis of parameter valuations guaranteeing liveness properties for the 
Bounded Retransmission Protocol [11]. 


6 Related Tools 


HyYTECH [41] was the first model checker for hybrid systems (a class of for- 
malisms beyond PTAs), including parameters; it is not maintained anymore. 

UPPAAL [42] is a state-of-the-art tool for modeling and verifying systems 
modeled as networks of timed automata and extended with variables and 
data structures; while UPPAAL became a major tool for model checking timed 
automata, it does not support parametric verification, and the use of clocks is 
restricted to comparing one clock with one constant or with another clock, while 
IMITATOR allows a liberal syntax based on polyhedra. 

RomMEOo [43] performs parameter synthesis for parametric time Petri nets 
with inhibitor arcs [47]. 

While RoMEO shares similarities with IMITATO R, it does not support (exten- 
sions of) timed automata, and notably not multi-rate clocks. 

SpaceEx [39] is a tool for verifying hybrid systems. It is not specifically ded- 
icated to parameter synthesis, and mainly targets safety and reachability, in 
contrast to IMITATOR that proposes multiple synthesis algorithms. 

IMITATOR’s input syntax also shares some similarities with that of PHAVer- 
Lite [33] (a fork of PHAVer and predecessor of SpaceEx, that uses PPLite [34] 
instead of PPL [32]), coming from the fact that both IMITATOR and PHAVerLite 
originate from the Hy TECH syntax. 


7 Perspectives 


To gain some further speed for models that require less expressiveness (notably 
no strict inequality nor rational-valued variables), offering to replace PPL [32] 
with PPLite [34], or using standard 32-bit integers instead of GNU MP rationals 
is on our agenda. 
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Abstract. In this paper, we investigate the design of a safe hybrid con- 
troller for an aircraft that switches between a classical linear quadratic 
regulator (LQR) controller and a more intelligent artificial neural net- 
work (ANN) controller. Our objective is to switch safely between the 
controllers, such that the aircraft is always recoverable within a fixed 
amount of time while allowing the maximum time of operation for the 
ANN controller. There is a priori known safety zone for the LQR con- 
troller operation in which the aircraft never stalls, over accelerates, or 
exceeds maximum structural loading, and hence, by switching to the 
LQR controller just before exiting this zone, one can guarantee safety. 
However, this priori known safety zone is conservative, and therefore, 
limits the time of operation for the ANN controller. We apply reach- 
ability analysis to expand the known safety zone, such that the LQR 
controller will always be able to drive the aircraft back to the safe zone 
from the expanded zone (“recoverable zone”) within a fixed duration. 
The “recoverable zone” extends the time of operation of the ANN con- 
troller. We perform simulations using the hybrid controller corresponding 
to the recoverable zone and observe that the design is indeed safe. 


1 Introduction 


Different types of controller designs have been investigated for aircraft control, 
such as Linear Quadratic Regulators [28], Fuzzy Logic (FL) [8], and Artificial 
Neural Networks [26]. The LQR controllers provide an optimal controller for 
linear time invariant (LTI) systems that minimizes a quadratic cost function 
and guarantees stability and robustness. Though the LQR design is not directly 
applicable to non-linear systems, often non-linear systems are approximated by 
linear systems via linearization around the equilibrium point, thus enabling the 
application of the LQR based design. Although the LQR controller provides 
good performance for LTI systems [28], studies have shown that the ANN con- 
trollers have better performance in the presence of uncertain environments [26]. 
The ANN controller is especially suitable for adaptive flight control applications, 
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where system dynamics are dominated by unknown nonlinearities [19]. An air- 
craft can experience a number of issues that may cause failures in the system. 
Things like over-acceleration can cause the aircraft to gain too much energy and 
enter into unstable modes, while rapid de-acceleration and hard maneuvers will 
cause increased structural loading, leading to broken lifting platforms. Another 
issue is that of stall, in which the airflow over the lifting section crosses a “criti- 
cal angle of attack”, compromising the lift generation. All of these problems can 
occur as a function of the control input or as external disturbances, such as high 
wind gust, further complicating the problem. Though ANN-based adaptive con- 
trollers are capable of handling these situations, guaranteeing safe functionality 
of these systems remains a challenge due to the complexity of these controllers. 
So, we have LQR-based controllers on one hand, that are efficient in nominal 
conditions, and are simple enough to be amenable to analysis, and sophisticated 
ANN-based controllers on the other hand that can handle difficult environmental 
conditions, but are, at the same time, too complex to be amenable to analysis. 
Our solution is a “hybrid controller” consisting of a simplex like architecture [7], 
wherein, we switch between the ANN and LQR controller in such a way that 
safety is guaranteed by the switching logic, that is, the aircraft is always recov- 
erable from a stall within a fixed amount of time if it occurs. 

Our broad objective is to find an ANN-based controller that can improve 
performance in uncertain environments. To achieve this goal, we need to train 
the ANN-controller, however, it is risky to train an ANN controller during a real 
flight test as it poses a safety risk. Hence, the solution we propose is to switch 
between a traditional LQR controller and the ANN controller in such a way that 
safety is guaranteed. More precisely, we allow the ANN controller to operate 
while the aircraft remains within a “safe zone” from which the LQR controller 
can guarantee that the aircraft never stalls. When the ANN controller is on the 
verge of leaving the safe zone, we switch to the LQR controller. However, these 
expert determined safe zones are often too conservative (small), thereby not 
providing sufficient time of operation for the ANN controller. A longer duration 
of operation for the ANN controller is desirable for the learning process, so we 
provide a method to extend the safe zone to a larger set (“recoverable zone”), 
which guarantees that the aircraft recovers within a fixed amount of time if a stall 
occurs. The recoverable zone computation is performed using formal methods 
based reachable set computation, thereby providing a formally verified switching 
component decision procedure that guarantees the safe operation of the aircraft. 

We consider a dynamic model of a fixed-wing aircraft, with six-degrees-of- 
freedom (6-DOF), which is used as an experimental platform to employ a hybrid 
controller that consists of an intelligent and automatic switching between an 
LQR and an ANN based controller. The aircraft dynamics consists of a decoupled 
longitudinal and lateral linear time invariant dynamics, with a decoupled state- 
feedback LQR controller for each component. For our simulations, we consider 
an ANN controller that combines aircraft guidance and control systems and 
performs end-to-end mapping from error states to control surface values, in order 
to fly along a straight line with steady state wings-level and altitude hold. 
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We have performed Hardware in The Loop (HITL) simulation of the hybrid 
controller in conjunction with the the 6-DOF differential equations, on the air- 
craft avionics using the open source software, QGroundControl. Our simulations 
exhibit that the number of sample iterations for which ANN controller actions 
are performed while ensuring safe flying, increases as the learning space (recov- 
erable zone) is expanded. 


2 Related Work 


Artificial Neural Networks have been widely used in many control applications, 
such as automatic generation control of interconnected power systems [41], 
irrigation scheduling [37], micro-turbine power plant [36], solar binding [4], 
robotics [1,6], and aircraft control [17]. ANN is popularly used in flight con- 
trol [19], robot control [25] as well as for non-linear systems [42]. 

Verification has been extensively applied to dynamical systems, and focus 
on over-approximation based methods including predicate abstraction [8,22], 
state-space exploration based fix-point computation [14], Hamilton-Jacobi based 
methods [2], symbolic state space exploration based methods [16], Satisfiabil- 
ity Modulo Theory (SMT) based methods [20,21,23,38], and counter-example 
guided abstraction-refinement based methods [24,31,32]. 

Recent studies [40] compare several neural network verification algorithms. 
Formal verification of feedforward neural networks with different activation func- 
tions, such as ReLU [18] and Lipschitz-continuous functions [33], have been 
studied. Different verification problems have been considered including output 
range analysis [10], and robustness analysis [15]. Verification methods include 
those based on reduction to satisfiability solving [18], optimization solving [12], 
abstract interpretation [35], abstraction-refinement [30], and linearization [13]. 
Verification of ANN with feedback controllers has been explored [11]. 

In this paper, one of the problems we study is stall. The stall could occur 
due to many reasons. Researchers have developed different techniques to avoid 
or recover from the stall. Deep stall has been studied [27], which is an uncon- 
trollable state at which the angle of attack (AOA) increases automatically and 
will be locked at a certain AOA which is far beyond the critical angle of attack. 
A stall due to wing has been studied [39]. The stall avoidance/recovery have 
been studied [9]. Here, we present a hybrid controller consisting of ANN and 
LQR controller similar to simplex design [7], which will not only recover, but 
also provide more learning space for the ANN controller to explore. Our hybrid 
controller is different from the simplex design [7] in many perspectives. Our 
hybrid controller makes the decision between ANN and LQR control input via 
safety checking performed based on an under-approximation reach set, which is 
computed off-line. However, in the work [7], the analysis is performed based on 
an over-approximation reach set. Also, in the work [7], an initial set is known; 
however, in our work, a target set (“safe zone”) is known and the initial set is 
unknown. 
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3 Hybrid Controller Architecture 


In this section, we provide details of the hybrid controller architecture which is 
shown in Fig. 1. It has mainly four components: (a) Aircraft dynamics, (b) LQR 
controller (c) ANN controller, and (d) Switching logic. For the aircraft dynamics, 
we consider a 6-DOF model of the fixed wing aircraft. The hybrid controller 
consists of the LQR and the ANN controller, and the switching logic; the LQR 
and the ANN controller each receive the state of the aircraft periodically (which 
is obtained from the aircraft dynamics model in the simulations) and compute 
the inputs to the aircraft. The switching logic decides which input is fed back to 
the aircraft (dynamics) at each sample time, based on the current state of the 
system. The state of the system (dynamics) is updated according to the input 
selected. We note that the details of the ANN controller is not important for 
the correctness of this work, since the safety is guaranteed even when the ANN 
control is considered as a black box. However, we adapt the ANN controller 
from the work [34] for the ANN component of the hybrid controller. We briefly 
describe the important aspects of the aircraft dynamics, LQR controller and the 
switching logic. 


Lateral 


Control Input: : Switch(x, 7, S): 
x - Current state, 


Longitudinal 


i TET Conta T - Sample time interval, 
! i S - Safe zone 


Output: u - Control input 
©: Uann = ANN(x) 
x’ = nextstate(x, Uann; T) 
if x € S then 
return Uann 


Aircraft 
Dynamics 


Way points 


Controller else 
Longitudinal states Ulgr = LQR(x) 
return Ur 
s end if 


Fig. 1. Hybrid controller architecture Fig. 2. Switching Logic for LQR 
and ANN controller 


3.1 Aircraft Dynamics 


We start with a brief description of the aircraft states and motion. The aircraft 
has 3 axes, the roll axis (J), pitch axis (J) and yaw axis (K) as shown in Fig. 3. 
Motion occurs in two planes, the longitudinal, axes (J) and (K), and lateral, 
axes (I) and (J), which are often considered to be decoupled. 

In the longitudinal plane, the states are, velocity (V), angle of attack (a), 
pitch angle (0) and pitch rate (q), and control inputs are thrust (ô+) and elevator 
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Fig. 3. Overview of aircraft 


deflection ôe. All the states and control inputs are shown in Fig.3. The angle 
of attack (a) is the angle between the roll axis (J) and the direction of velocity 
(V). The pitch angle (0) is the angle between the roll axis (I) and the horizontal 
axis. The pitch rate (q) is the rate of change in the pitch angle 6. When the 
pitch angle (0) changes, the lateral plane rotates and the roll and yaw axes will 
change to J, and Kj, respectively. The thrust (ô+) generates a force that is used 
to move the aircraft forward along the roll axis, and the elevator deflection (ôe) 
is a control surface located at the rear of the aircraft which primarily controls 
the pitch angle (0). The longitudinal dynamics is a linear dynamics of the form 
Xion = AlonXlon + BionYions where Xjon = [V, Q, 0, ql’, Ulon = (dr, bel’, and Alon 
and Bion are specific matrices. 

In the lateral plane, the states are, side-slip angle (8), roll angle (@), roll rate 
(p) and yaw rate (r), and control inputs are aileron deflection (ôa) and rudder 
deflection (6,). The states and control inputs are shown Fig.3. The angle of 
side-slip (3) is the angle between the roll axis (J) and the direction of incoming 
airflow. When the roll axis I rotates, the pitch axis (J) and the yaw axis (K) 
will change to Jz and Ko, respectively. The roll angle (@) is the angle between J 
and J2. The roll rate (p) is the rate of change in the roll angle (¢). The yaw rate 
(r) is the rotational rate of change in the yaw axis (K). The aileron deflection 
(ôa) is the control surface which is used to control the rotation of the roll axis 
(I). The rudder deflection (ô+) is the control surface which is used to control the 
rotation of the yaw axis (K). The lateral dynamics is a linear dynamics of the 
form Xlat = AlatX lat + BiatUlat, where Xlat = [C, Q, p, ‘ae Ulat = [Ja, ôr], and Alat 
and Biat are specific matrices. 
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3.2 LQR Controller 


Linear Quadratic Regulator (LQR) controller for a linear dynamics x = Ax+ Bu 
is an optimal controller that minimizes a quadratic cost function (J). It is a linear 
state feedback controller of the form —K«, where K is referred to as the gain 
matrix. The closed loop dynamics is given by x = (A — BK)x; which is the 
system behavior when controller by the LQR controller. Since the longitudinal 
and lateral dynamics of the aircraft are decoupled, we have an LQR controller 
for each component with gains Kion and Kiat, resulting in corresponding closed 
loop systems, Xion = (Aion = Bion Kion)Xion and Xjat = (Alat =- BiatK vat) Xat- 


3.3 Switching Algorithm for the Safety of ANN Controller 


Stall is one of the important issues for any aircraft. Stall is a condition in which 
the angle of attack surpasses a critical bound and greatly decreases lift genera- 
tion. Consequently, the aircraft will start rapidly descending. Additional prob- 
lems occur when the aircraft encounters large accelerations, primarily about the 
roll and yaw axes, which can lead the aircraft into an unstable spiral mode, a 
dangerous and usually unrecoverable event. Finally, rapid maneuvers can lead to 
large loads on the aircraft structure, causing permanent deformation or breaking 
the structure altogether. Generally, exact constraints for these problems cannot 
be found due to the complexity of aircraft motion. However, a set of safe con- 
straints has been generated for the testbed aircraft by examining previous flight 
test data in which problems did not occur. 

The objective of the switching logic is to arbitrate the switching between the 
LQR and ANN based controllers, while maintaining safety and at the same time 
providing ANN controller the maximum opportunity to operate, and thereby 
learn. Our premise is that we have some known safe zone S give by an expert 
in which LQR controller actions are safe, that is, if we apply control input u = 
— Kx, when x € S, to the LTI dynamics of the aircraft, then the aircraft never 
stalls. However, if we apply control input u’ obtained by the ANN controller at 
a state x € S, we cannot ensure that the system never stalls. Computing such 
a safe zone for an ANN controller would be computationally hard. Hence, the 
switching algorithm computes the effect of applying u’ computed by the ANN 
controller, and decides to pass it on to the system, if it infers that the system 
will be safe in the next step. Otherwise, it outputs the input suggested by the 
LQR controller. In either case, it ensures that the system is in the S region at all 
times during the operation of the flight. The details of the switching algorithm 
are provided in Fig. 2. 

The performance of the hybrid controller depends on the safe zone. The safe 
zone obtained by expert advice is often conservative. Hence, we provide a method 
to extend the safe zone (“recoverable zone”) for which the switching algorithm 
guarantees that the system is always recoverable within the fixed duration if it 
occurs. Next, we provide the details of computing the recoverable zone. 


572 R. Lal et al. 


4 Computation of Recoverable Zone 


In this section, we provide the details of computing a recoverable zone for the 
fixed time T > 0. Our broad goal is to compute all those states from which the 
given safe zone S can be reached within the time T > 0 for an LTI dynamics 
of aircraft which is in the form of x = (A — BK)x, where K is an LQR control 
gain matrix. This is the problem of computing the backward reach set of a linear 
system 


x= Cx (1) 


where C = A— BK. The solution of a linear system x = Cx is given by 
x(t) = e@'a(0), where z(t) is the state of the system at time t. Hence, we define 
the backward reach set for a given linear closed loop system as follows: 


Definition 1. [Backward Reach Set] Given a linear closed loop system «= Ag, 
a time horizon T > 0, and a final set of states Xs, the backward reach set 
Reachg(Xr, A, [0, T]) is defined as follows: 


Reachg(X;,A,[0,T]) = {æ | 3 t € [0, T], etx € Xr}. 


Next, we formally define the recoverable zone in terms of backward reach set. 


Definition 2. /Recoverable Zone] Given system in Eq. (1), a time horizon T > 
0, and a safe zone S, a recoverable zone S’ is defined as follows: 
S' = Reachg(S, C, [0, T]). 


The computation of the recoverable zone S’ can be alternatively tackled using a 
forward reachability analysis on the following transformed equation. 


x = -Cx (2) 
We define forward reach set for a given linear closed loop system as follows: 


Definition 3. [Forward Reach Set] Given a linear closed loop system «= Aa, a 
time horizon T > 0, and an initial set of states Xo, forward reach set Reachr(Xo, 
A,[0,T]) is defined as follows: 


Reachr(Xp, A, [0, T]) = {e4*a | 3 t € [0, T], 3 a € X}. 


Equation (2) is obtained from Eq. (1) by negating the right hand side. The 
effect of the transformation is that the system now evolves backward in time. 
We notice that the set of states that can reach S within time T from Equation 
(1) (Reachg(S, C, [0, T])) is equal to the set of states reached using Equation (2) 
from S in a given time horizon T > 0 (Reachr(S,C, [0, T])). Next, we formulate 
this equivalence of forward and backward reach sets of the two systems, namely 
Equations (1), (2) in Theorem 1. 
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Theorem 1. Given systems in Equation (1) and Equation (2), a time horizon 
T > 0, a safe zone S, we have Reachp(S,—C, [0,T]) = Reachg(S, C, [0, T]). 


The computation of the exact recoverable zone is complex because the solu- 
tion of Equation (2) consists of exponential function, and there are no known 
algorithms for solving constraints with exponential functions, unlike solvers for 
linear and polynomial functions. Hence, several over-approximation methods 
have been investigated [5,16,20,29,31,32]. An over-approximated recoverable 
zone violates the property of the recoverable zone, that is, it contains point 
that are not guaranteed to reach the safe zone within the time bound. In this 
situation, the stall may not be recoverable if it occurs. Therefore, we compute 
an under-approximation of the exact recoverable zone S’ which is conservative, 
nevertheless, ensures the safety of the switching algorithm. 


4.1 Under-Approximation of Recoverable Zone 


In this section, we provide a method to compute an under-approximation of the 
exact recoverable zone S’. While computing under-approximations are in general 
hard, we use a simple idea that provides a practically viable under-approximation 
for our purposes. Our broad approach is based on sampling, and consists of an 
under-approximate reach set which is the union of the reach set at certain time 
points, as opposed to all the points in the given interval. We sample the time 
interval [0, T] at sample times that are multiples of r. Then, we compute a 
reach set from safe zone S under Equation (2) at sample times r, 2r,...,kr = 

and take their union, a is, the under-approximation of the recover able zone 


denoted Approa(S) is Ù Reachp (S, —C, ir), where Reach(S,—C, ir) denotes the 


forward reach set Ban s at time ir. Next, we show that Approz(S) is an under- 
approximation of the recoverable zone s!. We formulate this in Theorem 2. 


Theorem 2. Given system in Equation (2), a time horizon T > 0, a safe zone 
S, we have Approx(S) C Reachp(S, —C, [0, T]). 


Note that Approx(S) converges to the exact recoverable zone S’ as r — 0. 


5 Experimental Analysis 


In this section, we provide the details of our implementation of hybrid controller 
architecture. Then, we present the experimental results. 


5.1 Experimental Setup 


The experimentation method for preliminary concept testing is a Hardware in 
The Loop (HITL) simulation. The HITL runs the 6-DOF differential equations, 
on the aircraft avionics, which are then propagated using a Runge-Kutta fourth 
order integration method. 
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Fig. 4. AFS 6.0 Fig. 5. HITL aggressive trajectory 


This technique generates all aircraft states and control inputs that are nec- 
essary to the operation of the switch. The main advantage of conducting these 
simulations as an HITL rather than software simulations is that all the codes 
will be tested on the actual hardware used for flight, showcasing any shortcom- 
ings in computation power or integration missteps, which may impact flight test 
success. 

The current avionics, Autopilot Flight System (AFS) 6.0, consists of three 
main components. Sensor data and outputs are handled by the Pixhawk 2.1 cube. 
The onboard computer which runs the in-house designed guidance, navigation 
and control (GNC) algorithms, as well as handles the state emulation is the 
Nvidia Tegra Nano. The Tegra Nano is a low cost system, with a quad-core 
CPU and a 128 core GPU. The final component is a 900 MHz telemetry unit 
which serves as the communication between the aircraft and the ground station, 
where the ground station provides a visual representation of the current aircraft 
state as well as relevant GNC information. The ground station used for these 
simulations is a modified version of the open source software, QGroundControl, 
which is also used to generate way-points for the given area of operation. Figure 4 
shows both the front and back sides of the custom avionics boards. 

While in HITL, the ANN controllers are very stable due to being trained 
with similar dynamic models to those that are used to propagate the simulation. 
This makes it unlikely to see the switching logic in action as no control inputs 
would be deemed unsafe, especially in grid or racetrack patterns that make 
up the majority of flight test operations. To circumvent this, an oddly shaped 
trajectory, shown in Fig. 5, with multiple sharp turns is used to ensure previously 
un-visited states are achieved. The simulation is run for approximately one lap 
of the given trajectory for each value of the time horizon shown in the following 
section. 


5.2 Experimental Results 


In this section, we present the simulation results for the performance of hybrid 
controller. For the simulation, we consider the safe zone provided by experts, 
which are given in Table 1. We run the simulation for different recoverable zones, 
which are computed for different values of time horizon T, namely, T = 0.05, 
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T = 0.15, T = 0.25, and T = 0.35 with time step T = 0.05 unit. The simulation 
results are shown in Figs.6 and 7. The simulation results are plotted in Fig. 6 
and Fig.7 for longitudinal velocity and lateral angle of side-slip, respectively. 
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Fig. 6. Switching between ANN and LQR controller for the longitudinal velocity 


Table 1. Safe zone for longitudinal and lateral state variables 


Safe Zone | V (Feet/sec.) | œ (Radians) | 8 (Radians) | Q (Radians/sec.) | 8 (Radians) | @ (Radians) | P (Radians/sec.) | R (Radians/sec.) 
Min -15 —0.087 —0.262 —0.262 —0.122 —0.785 —0.873 —0.349 
Max 15 0.087 0.262 0.262 0.122 0.785 0.873 0.349 


B (Radians) 


In both Figs.6 and 7, we observe that the recoverable zone expands when 
the time horizon T increases. 


Switching for T=0.05 


Switching for T=0.15 


Switching for T=0.25 


Switching for T=0.35 


mm 


0.10 4 


æo 
æa 


0 1000 2000 3000 


0 1000 2000 


3000 
Sample Ite 


1000 2000 3000 


7 0 
rations 


0 1000 2000 3000 


Fig. 7. Switching between ANN and LQR controller for the lateral angle of side-slip 


Also, we observe that the number of sample iterations in which ANN con- 
troller actions are performed, increases when the recoverable zone is expanded. 
For instance, in Figs.6 and 7, for T = 0.35, ANN controller actions have been 
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performed from the sample iteration 1500 to 3000, which was not the case for 
T = 0.25. For clarity, in Table 2, we present the number of sample iterations N 
for both ANN and LQR controller in which their actions have been performed, 
for different values of time horizon T. 


Table 2. Number of sample iterations for ANN and LQR controller 


N for T = 0.05 N for T = 0.15 | N for T = 0.25 | N for T = 0.35 
ANN | 3181 3269 3295 3345 
LQR | 220 132 106 56 
Total 3401 3401 3401 3401 


In Table 2, we observe that N grows for ANN controller when the time horizon 
T increases, that is, the recoverable zone is expanded. However, N decreases for 
LQR controller when T increases. This validate the fact that hybrid controller 
framework provides ample time for the ANN controller to learn while ensuring 
a safe flight. 


5.3 Practical Challenges 


The implementation of the hybrid controller proved to be complex in two ways. 
First, the timing of the switching logic was important to the overall safety of the 
project. When delays are introduced into the system, the current state of the 
aircraft and the information the switch is making the decision on can become out 
of sync. If the switching logic is behind the aircraft states it can make incorrect 
calls on whether or not the aircraft is still safe, and cause the ANN to overextend 
its operation, leading to a loss of control. This is made worse as aircraft have 
large inertias and relatively slow time constants on control inputs meaning they 
can become uncontrollable much quicker than most dynamic systems. This need 
for extreme low latency operation caused many changes in the code structure 
including a rewrite from Python to C++ and parallelization of applicable code. 
The second practical problem is that the lack of full state feedback and low- 
quality sensor data. Two of the aircraft states, angle of attack and sideslip angle, 
cannot be directly measured by low cost systems. The easiest solution is to 
employ a Kalman filtering technique to estimate these two states. However, if 
the aircraft is experiencing a large perturbation away from the trim point, the 
Kalman Filter can diverge very rapidly and feed incorrect information to the 
switch about the relevant states. On top of this, many of the measured states 
are taken using low-cost, off the shelf components. In a similar way, the use of 
these components may introduce noise or a bias which could allow the aircraft 
to go into the uncontrollable region without alerting the switch or the aircraft 
operator. Low pass filtering is applied to attempt to deal with the noise, but the 
imparted delay to the sensor data must also be taken into consideration. 
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6 Conclusions 


We have developed a hybrid controller for an aircraft dynamics which provides 
considerable amount of time to the ANN controller to operate and learn, while at 
the same time guarantees the safe operation of the flight at all times. In future, 
we will consider more sophisticated ANN controllers and investigate methods for 
computing larger recoverable zones that allow for further increase of the ANN 
operation time. Additionally, experimentation will be done with real flight tests, 
moving past HITL simulations. 
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Abstract. We present SceneChecker, a tool for verifying scenarios involving 
vehicles executing complex plans in large cluttered workspaces. SceneChecker 
converts the scenario verification problem to a standard hybrid system verifica- 
tion problem, and solves it effectively by exploiting structural properties in the 
plan and the vehicle dynamics. SceneChecker uses symmetry abstractions, a 
novel refinement algorithm, and importantly, is built to boost the performance 
of any existing reachability analysis tool as a plug-in subroutine. We evaluated 
SceneChecker on several scenarios involving ground and aerial vehicles with 
nonlinear dynamics and neural network controllers, employing different kinds of 
symmetries, using different reachability subroutines, and following plans with 
hundreds of waypoints in complex workspaces. Compared to two leading tools, 
DryVR and Flow*, SceneChecker shows 14x average speedup in verification 
time, even while using those very tools as reachability subroutines. 


Keywords: Hybrid systems - Safety verification - Symmetry 


1 Introduction 


Remarkable progress has been made in safety verification of hybrid and cyber- 
physical systems in the last decade [2-9]. The methods and tools developed have been 
applied to check safety of aerospace, medical, and autonomous vehicle control sys- 
tems [4,5,10,11]. The next barrier in making these techniques usable for more com- 
plex applications is to deal with what is colloquially called the scenario verification 
problem. A key part of the scenario verification problem is to check that a vehicle or an 
agent can execute a plan through a complex environment. A planning algorithm (e.g., 
probabilistic roadmaps [12] and rapidly-exploring random trees (RRTs) [13]) generates 
a set of possible paths avoiding obstacles, but only considering the geometry of the 
scenario, not the dynamics. The verification task has to ensure that the plan can indeed 
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be safely executed by the vehicle with all the dynamic constraints and the state esti- 
mation uncertainties. Indeed, one can view a scenario as a hybrid automaton with the 
modes defined by the segments of the planner, but this leads to massive models. Encod- 
ing such automata in existing tools presents some practical hurdles. More importantly, 
analyzing such models is challenging as the over-approximation errors and the analysis 
times grow rapidly with the number of transitions. At the same time, such large hybrid 
verification problems also have lots of repetitions and symmetries, which suggest new 
opportunities. 

We present SceneChecker, a tool that implements a symmetry abstraction- 
refinement algorithm for efficient scenario verification. Symmetry abstractions signif- 
icantly reduce the number of modes and edges of an automaton H by grouping all 
modes that share symmetric continuous dynamics [14]. SceneChecker implements a 
novel refinement algorithm for symmetry abstractions and is able to use any exist- 
ing reachability analysis tool as a subroutine. Our current implementation comes with 
plug-ins for using Flow* [4] and DryVR [6]. SceneChecker’s verification algorithm 
is sound, i.e., if it returns safe, then the reachset of H indeed does not intersect the 
unsafe set. The algorithm is lossless in the sense that if one can prove safety without 
using abstraction, then SceneChecker can also prove safety via abstraction-refinement, 
and typically a lot faster. SceneChecker can be found on figshare: https://figshare.com/ 
articles/software/CAV2021_reduce_v6_ova/14504352 and its website: https://publish. 
illinois.edu/scenechecker/. An extended version of this paper is available online [1]. 

SceneChecker offers an easy interface to specify plans, agent dynamics, obstacles, 
initial uncertainty, and symmetry maps. SceneChecker checks if a fixed point has been 
reached after each call to the reachability subroutine, avoiding repeating computations. 
First, SceneChecker represents the input scenario as a hybrid automaton H where modes 
are defined by the plan’s segments. It uses the symmetry maps provided by the user to 
construct an abstract automaton H,. Automaton H, represents another scenario with 
fewer segments, each representing an equivalence class of symmetric segments in H. 
A side effect of the abstraction is that upon reaching waypoints in H,, the agent’s state 
resets non-deterministically to a set of possible states. For example, in the case of rota- 
tion and translation invariance, the abstract scenario would have a single segment for 
any set of segments with a unique length in the original scenario. SceneChecker refines 
H, by splitting one of its modes to two modes. That corresponds to representing a set 
of symmetric segments with one more segment in the abstract scenario, capturing more 
accurately the original scenario! . 

We evaluated SceneChecker on several scenarios where car and quadrotor agents 
with nonlinear dynamics follow plans to reach several destinations in 2D and 3D 
workspaces with hundreds of waypoints and polytopic obstacles. We considered dif- 
ferent symmetries (translation and rotation invariance) and controllers (Proportional- 
Derivative (PD) and Neural Networks (NN)). We compared the verification time of 
SceneChecker with DryVR and Flow* as reachability subroutines against Flow* and 
DryVR as standalone tools. SceneChecker is faster than both tools in all scenarios con- 
sidered, achieving an average of 14x speedup in verification time (Table 1). In certain 
scenarios where Flow* timed out (executing for more than 120 min), SceneChecker 


' A figure showing the architecture of SceneChecker can be found in the extended version [1]. 
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is able to complete verification in as fast as 12min using Flow* as a subroutine. 
SceneChecker when using abstraction-refinement achieved 13x speedup in verifica- 
tion time over not using abstraction-refinement in scenarios with the NN-controlled 
quadrotor (Sect. 7). 


Related Work. The idea of using symmetries to accelerate verification has been 
exploited in a number of contexts such as probabilistic models [15,16], automata 
[17,18], distributed architectures [19], and hardware [20,21]. Some symmetry utiliza- 
tion algorithms are implemented in Mur@ [22] and Uppaal [23]. 

In our context of cyber-physical systems, Bak et al. [24] suggested using symme- 
try maps, called reachability reduction transformations, to transform reachsets to sym- 
metric reachsets for continuous dynamical systems modeling non-interacting vehicles. 
Maidens et al. [25] proposed a symmetry-based dimensionality reduction method for 
backward reachable set computations for discrete dynamical systems. Majumdar et al. 
[26] proposed a safe motion planning algorithm that computes a family of reachsets 
offline and composes them online using symmetry. Bujorianu et al. [27] presented a 
symmetry-based theory to reduce stochastic hybrid systems for faster reachability anal- 
ysis and discussed the challenges of designing symmetry reduction techniques across 
mode transitions. 

In a more closely related research, we presented a modified version of DryVR 
that utilizes symmetry to cache reachsets aiming to accelerate simulation-based safety 
verification of continuous dynamical systems [28]. We developed the related tool 
CacheReach that implements a hybrid system verification algorithm that uses sym- 
metry to accelerate reachability analysis [29]. CacheReach caches and shares com- 
puted reachsets between different modes of non-interacting agents using symmetry. 
SceneChecker is based on the theory of symmetry abstractions of hybrid automata 
we presented in [14]. We suggested computing the reachset of the abstract automaton 
instead of the concrete one then transform it to the concrete reachset using symmetry 
maps to accelerate verification. SceneChecker is built based on this line of work with 
significant algorithmic and engineering improvements. In addition to the abstraction of 
[14], SceneChecker 1) maps the unsafe set to an abstract unsafe set and verifies the 
abstract automaton instead of the concrete one and 2) decreases the over-approximation 
error of the abstraction through refinement. SceneChecker does not cache reachsets 
and thus saves cache-access and reachset-transformation times and does not incur over- 
approximation errors due to caching that CacheReach suffers from [29]. At the imple- 
mentation level, SceneChecker accepts plans that are general directed graphs and poly- 
topic unsafe sets while CacheReach accepts only single-path plans and hyperrectan- 
gle unsafe sets. We show more than 30x speedup in verification time while having 
more accurate verification results when comparing SceneChecker against CacheReach 
(Table | in Sect. 7). 


2 Specifying Scenarios in SceneChecker 


A scenario verification problem is specified by a set of fixed obstacles, a plan, and an 
agent that is supposed to execute the plan without running into the obstacles (e.g., see 
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Fig. 1B). For ground and air vehicles, for example, the agent moves in a subset of the 2D 
or the 3D Euclidean space called the workspace. A plan is a directed graph G = (V, S) 
with vertices V in the workspace called waypoints and edges S called segments”. A 
general graph allows for nondeterministic and contingency planning. 

An agent is a control system that can follow waypoints. Let the state space of the 
agent be X and O C X be the uncertain initial set. Let Sini be the initial segment in G 
that the agent has to follow. From any state x € X, the agent follows a segment s € S 
by moving along a trajectory. A trajectory is a function — : X x S x R? — X that 
meets certain dynamical constraints of the vehicle. Dynamics are either specified by 
ordinary differential equations (ODE) or by a black-box simulator. For ODE models, & 


is a solution of an equation of the form: a (x,s,t) = f(E(x,5,1),5), for any t € R>? and 
é (x,s,0) =x, where f : X x S — X is Lipschitz continuous in the first argument. Note 
that the trajectories only depend on the segment the agent is following (and not on the 
full plan G). We denote by € ,fstate, & .lstate, and €.dom the initial and last states and 
the time domain of the time bounded trajectory €, respectively. 

We can view the obstacles near each segment as sets of unsafe states, O : S > 
2X. The map thound : S — R? determines the maximum time the agent should spend 
in following any segment. For any pair of consecutive segments (s,5’), i.e. sharing a 
common waypoint in G, guard((s,s’)) defines the set of states (a hyperrectangle around 
a waypoint) at which the agent is allowed to transition from following s to following s’. 


Scenario JSON file is the first of the two user inputs. It specifies the scenario: O 
as a hyperrectangle; S as a list of lists each representing two waypoints; guard 
as a list of hyperrectangles; thound as a list of floats; and O as a list of polytopes. 


Output of SceneChecker is the scenario verification result (safe or unknown) 
and a number of useful performance metrics, such as the number of mode- 
splits, number of reachability calls, reachsets computation time, and total time. 
SceneChecker can also visualize the various computed reachsets. 


3 Transforming Scenarios to Hybrid Automata 


The input scenario is first represented as a hybrid automaton by a Hybrid constructor. 
This constructor is a Python function that parses the Scenario file and constructs the 
data structures to store the scenario’s hybrid automaton components. In what follows, 
we describe the constructed automaton informally. In our current implementation, sets 
are represented either as hyper-rectangles or as polytopes using the Tulip Polytope 
Library. 


2 We introduce this redundant nomenclature because later we will reserve the term edges to talk 
about mode transitions in hybrid automata. We use waypoints instead of vertices as a more 
natural term for points that vehicles have to follow. 

3 https://pypi.org/project/polytope/. 
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Scenario as a Hybrid Automaton. A hybrid automaton has a set of modes (or discrete 
states) and a set of continuous states. The evolution of the continuous states in each 
mode is specified by a set of trajectories and the transition across the modes are specified 
by guard and reset maps. The agent following a plan in a workspace can be naturally 
modeled as a hybrid automaton H, where Sini and © are its initial mode and set of 
states. 

Each segment s € S of the plan G defines a mode of H (e.g. see Fig. 1A). The set 
of edges E C S x S of H is defined as pairs of consecutive segments in G. For an edge 
e € E, guard(e) is the same as that of G. The reset map of H is the identity map. We 
will see in Sect. 5 that abstract automata will have nontrivial reset maps. 


Verification Problem. An execution of length k is a sequence o := (&,50),---; (Ek, Sk). 
It models the behavior of the agent following a particular path in the plan G. An exe- 
cution o must satisfy: 1) &o.fstate € © and so = Sinin, for each i € {0,...,k — 1}, 2) 
(Si,Si+1) € E, 3) €;.lstate € guard((s;,8;11)), and 4) €;.lstate = i41 .fstate, and 5) for 
each i € {0,...,k}, dom < tbound(s;). The set of reachable states is Reachy := 
{o.lstate | o is an execution}. The restriction of Reachy to states with mode s € S 
(i.e., agent following segment s) is denoted by Reachy(s). Thus, the hybrid system 
verification problem requires us to check whether Vs € S, Reachy (s) N O(s) = 9. 


4 Specifying Symmetry Maps in SceneChecker 


The hybrid automaton representing a scenario, as constructed by the Hybrid 
constructor, is transformed into an abstract automaton. SceneChecker uses symme- 
try abstractions [14]. The abstraction is constructed by the abstract function (line | of 
Algorithm 1) which uses a collection of pairs of maps ® = {(%s :X —> X,ps:S—S)h}ses 
that is provided by the user. We describe below how these maps are specified by the user 
in the Dynamics file. These maps should satisfy: 


Vt >0,x9 €X,8 €S, Y (E (x0, 5,t)) = E (% (x0), Ps(s), 1). (1) 


where V's € S, the map % is differentiable and invertible. Such maps are called symme- 
tries for the agent’s dynamics. They transform the agent’s trajectories to other symmet- 
ric ones of its trajectories starting from symmetric initial states and following symmetric 
modes (or segments in our scenario verification setting). It is worth noting that (1) does 
not depend on whether the trajectories € are defined by ODEs or black-box simula- 
tors. Currently, condition (1) is not checked by SceneChecker for the maps specified by 
the user. However, in the following discussion, we present some ways for the user to 
check (1) on their own. For ODE models, a sufficient condition for (1) to be satisfied 
is if: VxeX,s ES, o% F(x, 5) = f(%(x),Ps(s)), where f is the right-hand-side of the 
ODE [30]. For black-box models, (1) can be checked using sampling methods. In realis- 
tic settings, dynamics might not be exactly symmetric due to unmodeled uncertainties. 
In the future, we plan to account for such uncertainties as part of the reachability anal- 
ysis. 

In scenario verification, a given workspace would have a coordinate system accord- 
ing to which the plan (waypoints) and the agent’s state (position, velocity, heading 
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angle, etc.) are represented. In a 2D workspace, for any segment s € S, an example 
symmetry ps would transform the two waypoints of s to a new coordinate system where 
the second waypoint is the origin and s is aligned with the negative side of the hor- 
izontal axis (see Fig. 1D). The corresponding y, would transform the agent’s state to 
this new coordinate system (e.g. by rotating its position and velocity vectors and shift- 
ing the heading angle). For such a pair (%,Ps) to satisfy (1), the agent’s dynamics 
have to be invariant to such a coordinate transformation and (1) merely formalizes this 
requirement. Such an invariance property is expected from vehicles’ dynamics-rotating 
or translating the lane should not change how an autonomous car behaves. 


Dynamics file is the second input provided by the user in addition to the 
Scenario file and it contains the following: 
polyVir(X’,s): returns y,(X’) for any polytope X’ C X and segment s € S. 
modeVir(s): returns p;(s) for any given segment s € S. 


virPoly(X’,s): returns y, '(X’), implementing the inverse of polyVir. 

computeReachset(initset,s,T): returns a list of hyperrectangles over- 
approximating the agent’s reachset starting from initset following segment s 
for T time units, for any set of states initset C X, segment s € S, and T > 0. 


5 Symmetry Abstraction of the Scenario’s Automaton 


In this section, we describe how the abstract function in Algorithm 1 uses the functions 
in the Dynamics file to construct an abstraction of the scenario’s hybrid automaton 
provided by the Hybrid constructor. Given the symmetry maps of ®, the symmetry 
abstraction of H is another hybrid automaton H, that aggregates many symmetric modes 
(segments) of H into a single mode of Hy. 


Modes and Transitions. Any segment s € S of H is mapped to the segment p(s) in Hy 
using modeVir. The set of modes S, of H, is the set of segments {p;(s)}scs. For any 
Sy, thound, (Sy) = MaXs<g s,p,(s) hound(s). In the example of Sect. 4 (Fig. 1D), the seg- 
ments in H, are aligned with the horizontal axis and ending at the origin. The number of 
segments in H, would be the number of segments in G with unique lengths. The agent 
would always be moving towards the origin of the workspace in the abstract scenario. 
Any edge e = (s,s) € E of H is mapped to the edge e, = (ps(s), py (s")) in Hy. The 
guard(e) is mapped to y,(guard(e)) using polyVir which becomes part of guard, (e,) in 
H,. For any x € X, reset(x,e), which is equal to x, is mapped to yy (7; ! (x)) and becomes 
part of reset,(x,e,) in Hy. In our example in Sect. 4, the y,'(x) would represent x in 
the absolute coordinate system assuming it was represented in the coordinate system 
defined by segment s. The yy(y'(x)) would represent y,!(x) in the new coordinate 
system defined by segment s’. The guard,(e,) would be the union of rotated hyperrect- 
angles centered at the origin that result from translating and rotating the guards of the 
edges represented by e,. The initial set © of H is mapped to O, = ¥,,,,,(), the initial 
set of H,. A formal definition of symmetry abstractions can be found in [1] (or [14]). 
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The unsafe map O is mapped to O,, where Ysy € Sy, Ov(Sv) = Uses,p,(s)=s, ¥s(O(S))- 
That means that the obstacles near any segment s € S in the environment will be mapped 
to be near its representative segment p,(s) in Hy. 

A forward simulation relation between H and H, can show that if H, is safe with 
respect to O,, then H is safe with respect to O. More formally, if Vs, E€ S,,Reachy, (sv) A 
O,(s,) = 0, then Vs € S,Reachy(s)N O(s) = 0 [14]. 


6 SceneChecker Algorithm Overview 


A sketch of the core abstraction-refinement algorithm is shown in Algorithm 1. It con- 
structs a symmetry abstraction H, of the concrete automaton H resulting from the 
Hybrid constructor. SceneChecker attempts to verify the safety of H, using traditional 
reachability analysis. SceneChecker uses a cache to store per-mode initial sets from 
which reachsets have been computed and thus avoids repeating computations. An exam- 
ple run is shown in Fig. 1. 


Fig. 1. A simple scenario with a car following a plan with six segments is shown in B. Set of 
initial positions (green square), unsafe set (grey), and the segments (black lines). The automaton 
(A) has one mode per segment. Translation and rotation symmetries are used to abstract A to the 
automaton C. The abstraction translates and rotates each segment of the original scenario to a 
segment aligned with the x-axis and ends at the origin resulting in the segments (i.e. modes) s? 
and sl. The unsafe set is transformed accordingly for each mode as shown in D. SceneChecker 
computes the reachset of C which turns out to be unsafe; to illustrate the process this abstract 
reachset transformed to the original scenario is shown in E. The colors refer to a different abstract 
modes. The algorithm refines C to F by adding s2 (same segment as si but different guard). The 
reachset of F is safe and the algorithm terminates (H). (The colored figure is available in the 
online version of this paper) 


The core algorithm verify (Algorithm 2) is called iteratively. If verify returns 
(safe, L) or (unknown, L), SceneChecker returns the same result. If verify instead 
results in (refine, s*), splitMode (check the extended version of this paper [1] for the 
formal definition) is called to refine H, by splitting s* into two modes s! and s2. Each of 
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Algorithm 1. SceneChecker(® = { (Y, Ps) }ses,H,O) 
1: H,,O, — abstract(H, O, ®) 
2: Vs € S,rv[s] — ps(s) 
3: while True do 
cache — {sy 0 | sy € Sy} 
result, s% — verify (rv|[sinit], Oy, cache, rv, Hy, Oy) 
if result = safe or unknown then return: result 
else rv, Hy, Oy — splitMode(s} , rv, Hy, Oy, H,O) 


SONU 


the two modes would represent part of the set of the segments of S that were originally 
mapped to s, in rv. Then the edges, guards, resets, and the unsafe sets related to s, are 
split according to their definitions. 

The function verify executes a depth first search (DFS) over the mode graph of Hy. 
For any mode s, being visited, computeReachset computes R,, an over-approximation 
of the agent’s reachset starting from initset following segment s, for time tbound,(s,). 
If Ry O O,(s,) = 0, verify recursively calls s,’s children continuing the DFS in line 6. 
Before calling each child, its initial set is computed and the part for which a reachset 
has already been computed and stored in cache is subtracted. If all calls return safe, 
then initset is added to the other initial sets in cache|s,] (line 12) and verify returns safe. 
Most importantly, if verify returns (refine, s*) for any of s,’s children, it directly returns 
(refine, s;) for sy as well (line 7). If any child returns unknown or R, intersects O,(s,), 
verify will need to split sẹ. In that case, it checks if rv~![s,] is not a singleton set and thus 
amenable to splitting (line 10). If s, can be split, verify returns (refine, s,). Otherwise, 
verify returns (unknown, L) implicitly asking one of s,’s ancestors to be split instead. 


Correctness. SceneChecker ensures that all the refined automata H,,’s are abstractions 
of the original hybrid automaton H (a proof is given in the extended version of this 
paper [1]). For any mode with a reachset intersecting the unsafe set, SceneChecker 
keeps refining that mode and its ancestors until safety can be proven or H, becomes H. 


Theorem 1 (Soundness). Zf SceneChecker returns safe, then H is safe. 


If verify is provided with the concrete automaton H and unsafe set O, it will be the tradi- 
tional safety verification algorithm having no over-approximation error due to abstrac- 
tion. If such a call to verify returns safe, then SceneChecker is guaranteed to return safe. 
That means that the refinement ensures that the over-approximation error of the reachset 
caused by the abstraction is reduced to not alter the verification result. 


Counter-examples. _SceneChecker currently does not find counter-examples to 
show that the scenario is unsafe. There are several sources of over-approximation 
errors, namely, computeReachset and guard intersections. Even after all the over- 
approximation errors from symmetry abstractions are eliminated, as refinement does, 
it still cannot infer unsafe executions or counter-examples because of the other errors. 
We plan to address this in the future by combining the current algorithm with systematic 
simulations. 
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Algorithm 2. verify(s,, initset, cache, rv, Hy, Oy) 


1: Ry — computeReachset(initset, sy) 
2: if R,NO,(sy) = 0 then 
3: for s!, € children(s,) do 


4: initset’ — resety(guard,((sy,s,)) NRy)\cache[s',] 
5: if initser’ + 0 then 

6: result, s* — verify(s’,, initset’,cache, rv, Hy, Oy) 
T: if result = refine then return: refine, s; 

8: else if result = unknown then break 

9: if Ry NO,(sy) Æ 0 or result is unknown then 
10: if |rv—![s,]| > 1 then return: refine, s, 
11: else return: unknown, L 


12: cache|sy] — cache[sy] U initset 
13: return: safe, L 


7 Experimental Evaluation 


Agents and Controllers. In our experiments, we consider two types of nonlinear agent 
models: a standard 3-dimensional car (C) with bicycle dynamics and 2 inputs, and a 
6-dimensional quadrotor (Q) with 3 inputs. For each of these agents, we developed a 
PD controller and a NN controller for tracking segments. The NN controller for the 
quadrotor is from Verisig’s paper [9] but modified to be rotation symmetric (check the 
extended version of this paper [1] for more details). Similarly, the NN controller for the 
car is made rotation symmetric. Both NN controllers are translation symmetric as they 
take as input the difference between the agent’s state and the segment being followed. 
The PD controllers are translation and rotation symmetric by design. 


Symmetries. We experimented with two different collections of symmetry maps ®s: 1) 
translation symmetry (T), where for any segment s in G, Ys maps the states so that the 
coordinate system is translated by a vector that makes its origin at the end waypoint of 
s, and 2) rotation and translation symmetry (TR), where instead of just translating the 
origin, ® rotates the xy-plane so that s is aligned with the x-axis, which we described 
in Sect. 4. For each agent and one of its controllers, we manually verified that condition 
(1) is satisfied for each of the two @s using the sufficient condition for ODEs in Sect. 4. 


Scenarios. We created four scenarios with 2D workspaces (S1-4) and one scenario with 
a 3D workspace (S5) with corresponding plans. We generated the plans using an RRT 
planner [31] after specifying a number of goal sets that should be reached. We modified 
S4 to have more obstacles but still have the same plan and named the new version S4.b 
and the original one $4.a. When the quadrotor was considered, the waypoints of the 2D 
scenarios (S1-4) were converted to 3D representation by setting the altitude for each 
waypoint to 0. Scenario S5 is the same as S2 but S5’s waypoints have varying altitudes. 
The scenarios have different complexities ranging from few segments and obstacles to 
hundreds of them. All scenarios are safe when traversed by any of the two agents. 

We verify these scenarios using SceneChecker and CacheReach, each with two 
instances, one with DryVR and the other with Flow*, implementing computeReachset. 
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We also use DryVR and Flow* as independent tools to verify the same scenarios. 
The results of experiments with tools that involve DryVR (i.e., SceneChecker+DryVR, 
CacheReach+DryVR, and DryVR) are stochastic and change between runs. The reason 
is that each time DryVR is called, it randomly samples traces of the system from which 
it computes the requested reachset. We fix the random seed for repeatable results in this 
section. We show close averaging-based results on SceneChecker’s website. 

SceneChecker is able to verify all scenarios with PD controllers. The results are 
shown in Table 14 and plotted for C-S1 using SceneChecker+Flow* in Fig. 1. 


Observation 1: SceneChecker offers fast scenario verification and boosts existing 
reachability tools Looking at the two total time (Tt) columns for the two instances 
of SceneChecker with the corresponding columns for Flow* and DryVR, it becomes 
clear that symmetry abstractions can boost the verification performance of reachabil- 
ity engines. For example, in C-S4.a, SceneChecker+DR was around 20x faster than 
DryVR. In C-S3, SceneChecker with Flow* was around 16x faster than Flow*. In sce- 
nario Q-S5, SceneChecker timed out at least in part because a computeReachset call 
to Flow* timed out. Even when many refinements are required and thus causing sev- 
eral repetitions of the verification process in Algorithm 1, SceneChecker is still faster 
than DryVR and Flow* (C-S4.b). All three tools resulted in safe for all scenarios when 
completed executions. 


Observation 2: SceneChecker is faster and more accurate than CacheReach Since 
CacheReach only handles single-path plans, we only verify the longest path in the 
plans of the scenarios in its experiments. CacheReach’s instance with Flow* resulted 
in unsafe reachsets in C-S1 and C-S4.b scenarios likely because of the caching over- 
approximation error. In all scenarios where CacheReach completed verification besides 
C-S4.b, it has more Rc and longer Tt (more than 30x in C-S2) while verifying simpler 
plans than SceneChecker using the same reachability subroutine. In all Q scenarios, 
both instances of CacheReach, with Flow* and DryVR, timed out. 


Observation 3: More symmetric dynamics result in faster verification time 
SceneChecker usually runs slower in 3D scenarios compared to 2D ones (Q-S2 vs. Q- 
S5) in part because there is no rotational symmetry in the z-dimension to exploit. That 
leads to larger abstract automata. Therefore, many more calls to computeReachset are 
required. 

We only used SceneChecker’s instance with DryVR for agents with NN- 
controllers®. We tried different Øs. The results are shown in Table 2. When not using 
abstraction-refinement, SceneChecker took 10.5, 130.95, and 74.15 min for the QNN- 
S2, QNN-S3, and QNN-S4 scenarios, while Dry VR took 5.22, 52.56, and 61.31 min for 
the same scenarios, respectively. Comparing these results with those in Table 2 shows 


7 Figures presenting the reachsets of the concrete and abstract automata for different scenarios 
can be found in the extended version of this paper [1] as well as the machine specifications. 

5 Check the extended version [1] for a discussion about our attempts for using other verification 
tools for NN-controlled systems as reachability subroutines. 
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that the speedup in verification time of SceneChecker is caused by the abstraction- 
refinement algorithm, achieving more than 13x in certain scenarios (QNN-S4 using 
® = T). SceneChecker+DR was more than 10x faster than DryVR in the same sce- 
nario. 


Table 2. Comparison between ®s. In addition to the statisitics of Table 1, this table reports the 
number of modes and edges in the initial and final (after refinement) abstractions (|S,|', |Ey|'; 
|S,|/, and |E,|/, respectively) 


Sc NRef ® ||S| | |S.) | [Eli | |S, | |E|] Re |Rt | Tt 

CNN-S2 6 TR |140 |1 1 |7 |17 19 |1.51| 3.05 
CNN-S4 9 TR | 520] 1 1 [10 |28 47 |3.77) 11.25 
QNN-S2 3 TR |140 |1 1 9 9 |0.61| 3.55 
QNN-S3 5 TR | 458 | 1 1 16 |15 | 1.51] 12.7 
QNN-S4 4 TR | 520] 1 1 13 |11 |111| 7.43 
QNN-S2 0 T |140|7 |19 |7 |19 8 0.53] 1.38 
QNN-S3 4 T |458|7 |30 |11 |58 |29 | 2.92} 16.88 
QNN-S4 0 T |520|7 |30 | 7 |30 |13 |1.32| 5.34 


Observation 4: Choice of ® is a trade-off between over-approximation error and num- 
ber of refinements The choice of ® affects the number of refinements performed and 
the total running times (e.g. QNN-S2, QNN-S3, and QNN-S4). Using TR leads to a 
more succinct H, but larger over-approximation error causing more mode splits. On the 
other hand, using T leads to a larger H, but less over-approximation error and thus fewer 
refinements. This trade-off can be seen in Table 2. For example, QNN-S4 with ® =T 
resulted in zero mode splits leading to |S,|‘ = |S,|/ = 7, while ® = TR resulted in 4 
mode splits, starting with |S,|’ = 1 modes and ending with |S,|/ = 5, and longer verifi- 
cation time because of refinements. On the other hand, in QNN-S3, ® = TR resulted in 
Nref= 5, |S,|/ = 6, and Tt= 12.7 min while ® =T resulted in Nref= 4, |S,|/ = 11, and 
Tt= 16.88 min. 


Observation 5: Complicated dynamics require more verification time Different vehicle 
dynamics affect the number of refinements performed and consequently the verification 
time (e.g. QNN-S2, QNN-S4, CNN-S2, and CNN-S4). The car appears to be less stable 
than the quadrotor leading to longer verification time for the same scenarios. This can 
also be seen by comparing the results of Tables | and 2. The PD controllers lead to more 
stable dynamics than the NN controllers requiring less total computation time for both 
agents. More stable dynamics lead to tighter reachsets and fewer refinements. 


8 Limitations and Discussions 


SceneChecker allows the choice of modes to be changed from segments to waypoints 
or sequences of segments as well. The waypoint-defined modes eliminate the need for 
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segments of G to have few unique lengths, but only allow ® = T. SceneChecker splits 
only one mode per refinement and then repeats the computation from scratch. It has 
to refine many times in unsafe scenarios until reaching the result unknown. We plan 
to investigate other strategies for eliminating spurious counter-examples and returning 
valid ones in unsafe cases. In the future, it will be important to address other sources of 
uncertainty in scene verification such as moving obstacles, interactive agents, and other 
types of symmetries such as permutation and time scaling. Finally, it will be useful to 
connect a translator to generate scene files from common road simulation frameworks 
such as CARLA [32], commonroad [33], and Scenic [34]. 
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Abstract. Hybrid system falsification is an important quality assurance 
method for cyber-physical systems with the advantage of scalability and 
feasibility in practice than exhaustive verification. Falsification, given a 
desired temporal specification, tries to find an input of violation instead 
of a proof guarantee. The state-of-the-art falsification approaches often 
employ stochastic hill-climbing optimization that minimizes the degree of 
satisfaction of the temporal specification, given by its quantitative robust 
semantics. However, it has been shown that the performance of falsifica- 
tion could be severely affected by the so-called scale problem, related to 
the different scales of the signals used in the specification (e.g., rpm and 
speed): in the robustness computation, the contribution of a signal could 
be masked by another one. In this paper, we propose a novel approach 
to tackle this problem. We first introduce a new robustness definition, 
called QB-Robustness, which combines classical Boolean satisfaction and 
quantitative robustness. We prove that QB-Robustness can be used to 
judge the satisfaction of the specification and avoid the scale problem 
in its computation. QB-Robustness is exploited by a falsification app- 
roach based on Monte Carlo Tree Search over the structure of the formal 
specification. First, tree traversal identifies the sub-formulas for which it 
is needed to compute the quantitative robustness. Then, on the leaves, 
numerical hill-climbing optimization is performed, aiming to falsify such 
sub-formulas. Our in-depth evaluation on multiple benchmarks demon- 
strates that our approach achieves better falsification results than the 
state-of-the-art falsification approaches guided by the classical quantita- 
tive robustness, and it is largely not affected by the scale problem. 
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1 Introduction 


Cyber-Physical Systems (CPS) are hybrid systems that combine physical systems 
(with continuous dynamics) and digital controllers (that are inherently discrete). 
Being often safety-critical, their quality assurance is of great importance and 
widely investigated by both academia and industry. The continuous dynamics 
of hybrid systems leads to infinite search spaces, making their verification often 
extremely difficult. 

Falsification has been proposed as a more practically feasible approach that 
tackles the dual problem of verification: instead of exhaustively proving a prop- 
erty, falsification intends to uncover the existence of its violation with counterex- 
amples. Formally, the problem is defined as follows. Given a model M taking an 
input signal u and outputting a signal M(u), and a specification p (a temporal 
formula), the falsification problem consists in finding a falsifying input, i.e., an 
input signal u such that the corresponding output M(u) violates y. 

The most pursued and successful approach to the falsification problem con- 
sists in turning it into an optimization problem; we call it optimization-based 
falsification. This is possible thanks to the quantitative robust semantics of tem- 
poral formulas [14,19]. Robust semantics extends the classical Boolean satisfac- 
tion relation v = y in the following way: it assigns a value |w, y] E€ RU{oo, —oo} 
(i.e., robustness) that tells not only whether is satisfied or violated (by the 
sign), but also how robustly the formula is satisfied or violated. 

Optimization-based falsification approaches adopt hill-climbing stochastic 
optimization strategies to generate inputs to decrease robustness, which ter- 
minate when they find an input with negative robustness, i.e., a falsifying input 
that triggers the violation of the specification y. Different optimization-based 
falsification algorithms have been proposed (see [26] for a survey), and mature 
tools (e.g., Breach [13] and S-TaLiRo [4]) have also been developed. 

The scale problem is a recognized issue in optimization-based falsification [21, 
40], which could arise when multiple signals with different scales are present in 
the specification. Namely, it is due to the computation of robust semantics of 
Boolean connectives, i.e., the way in which the robustness values of different 
sub-formulas are compared and aggregated: such computation is problematic in 
the presence of signals that take values having different order of magnitudes. 


Example 1. As very simple example, let us consider the formula gy = 
(0,30}(G1 A p2), with yı = gear < 6 and p2 = speed < 130. It is apparent 
that yı is always satisfied (in any car model with 5 gears), and it has been 
added in the specification as redundant check.! According to robust seman- 
tics, the Boolean connective A is interpreted by minimum M, and the “always” 
operator [jo 30] is interpreted by infimum []; the robustness of an atomic 
formula f(x) < c is given by the margin c — f(a). Therefore, the robust- 
ness of p under the signal (gear, speed), where gear, speed: [0,30] — R, is 


1 Note that we built such a trivial example just to make the scale problem very easy 
to understand. However, in general, the scale problem frequently occurs on much 
less trivial specifications, as we will see in the experiments. 
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[(gear, speed), ] = [ lre(0,30] ( (6 — gear(t)) U (130 — speed(t)) ). Note that the 
robustness of yı is always in the order of units, while the robustness of p2 is, 
in general, in the order of tens. It is not difficult to see that, if both yı and 
(2 are satisfied, the robustness of y will only depend on yı (because of the 
minimum in the robust semantics of the logical connective). In this case, we say 
that yı masks p2. In such a case, a falsification approach relying on robustness 
will be misled during the search. Note that, in this particular case, the only way 
to falsify y is to falsify ya, because vy, is always satisfied; therefore, falsifying 
this relatively simple formula could be extremely difficult for state-of-the-art 
optimization-based falsification approaches (as we will show and have confirmed 
in the experiments). 


In this paper, we propose a novel approach to tackle the scale problem in 
optimization-based falsification. Our intuition and insights are that we should 
try to avoid the comparison of robustness values of different sub-formulas, so 
that one sub-formula does not mask the contribution of another one. 

To achieve this, we first propose a new way of computing the satisfaction 
of a formula that combines quantitative robust semantics and Boolean seman- 
tics. We name the new semantics as QB-Robustness. QB-Robustness, for each 
type of formula y, requires selecting a sub-formula pọ among its sub-formulas 
{y1,.--, pK}. For Yk, the quantitative robust semantics is computed, while for 
the other sub-formulas the Boolean semantics is computed. Therefore, the com- 
putation of QB-Robustness requires identifying a path X along the parse tree of 
the formula y, where visited sub-formulas are those for which the quantitative 
robustness is computed. We prove that QB-Robustness, independently of the 
selected X, is equivalent (in terms of sign and satisfaction) to the quantitative 
robust semantics (and also to the Boolean one). 

In general QB-Robustness is a useful tool for avoiding the scale problem of 
falsification. By definition, the quantitative robustness of different sub-formulas 
is never compared, so removing the main cause of the scale problem. It would 
then make sense to use it for guiding the optimization-based falsification process. 
However, QB-Robustness requires to choose a particular sequence X of sub- 
formulas for which to compute the quantitative robustness. It is relatively easy to 
show that some of them provide a better guidance than others to the falsification 
search. Considering the previous example, if X contains 1, we can encounter the 
problem that the quantitative robustness of pı would not provide any guidance 
(i.e., no big variations in the robustness values would be observed). On the other 
hand, if X contains yo, the quantitative robustness would have larger variations, 
providing more effective guidance to the search. 

Then, the key problem is how to select the best X, that enables the hill- 
climbing optimization used in falsification to be more effective. In general, 
although it is often difficult to know the best X in advance, it is still possi- 
ble to learn it by observing sampling results using different X. Based on this 
intuition, we propose a novel falsification approach that identifies the sequences 
X that is more likely to be efficient, and uses them in the new falsification 
trials. Our approach could be seen as an instantiation of the classical Monte 
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Carlo Tree Search (MCTS) method [8,28], which is able to efficiently tackle the 
exploration-exploitation tradeoff. In our context, exploration consists in incre- 
mentally constructing the tree that represents all the possible sequences, and 
exploitation consists in selecting the best X and running optimization-based 
falsification in which QB-Robustness with X is used. 

Overall, the major Contributions of this paper are summarized as follows: 


— We propose a novel semantics (QB-Robustness) for STL formulas that com- 
bines quantitative robustness and Boolean satisfaction. We prove that QB- 
Robustness can be used to show the satisfiability of STL formulas; 

— We define a falsification approach based on MCTS that exploits QB- 
Robustness to address the scale problem; 

— We implement the approach in the tool ForeSee, based on which, we per- 
formed in-depth evaluation, demonstrating the effectiveness and advantage 
of our approach compared with the state of the art. 


Paper Structure. In Sect.2, we introduce the preliminaries of the optimization- 
based falsification. In Sect.3, we introduce the novel STL semantics QB- 
Robustness, and, in Sect. 4, we describe the MCTS-based falsification approach 
that uses QB-Robustness. In Sect. 5, we describe the experiments and evaluation 
results. Finally, we discuss most relevant work to ours in Sect. 6, and conclude 
the paper in Sect. 7. 


2 Preliminaries 


In this section, we briefly review the falsification framework based on robust 
semantics of temporal logic [14]. 

Let T € R+ be a positive real. An M-dimensional signal with a time horizon 
T is a function w: [0,7] — R™. We treat the system model as a black box, i.e., 
its behaviors are only observed from inputs and their corresponding outputs. 
Formally, a system model, with M-dimensional input and N-dimensional output, 
is a function M that takes an input signal u: [0,7] — RM and returns a signal 
M(u): [0,7] — R”. Here the common time horizon T € R4 is arbitrary. 


Definition 1 (STL Syntax). We fix a set Var of variables. In Signal Temporal 
Logic (STL), atomic propositions and formulas are defined as follows, respec- 
tively: a := f(zı,... £y) > 0, andy := alL|7Ay|Ag|VelOre|oOry| 
Ur y Here f is an N-ary function f : RY > R, z1,..., £y € Var, and I is a 
closed non-singular interval in Rso, i.e., I = [a,b] or [a, o0), where a,b € R and 
a < b. 0,6 and U are temporal operators, which are usually known as always, 
eventually and until respectively. The always operator DJ and eventually oper- 
ator can also be considered as special cases of the until operator U, where 
orp = T Ur y and Ory = -O1r7¢. Other common connectives such as —, T are 
introduced as syntactic sugar: T = 71, y1 > yo = 71 V y2. 


Definition 2 (Quantitative Robust Semantics). Let w: [0,7] — R be 
an N-dimensional signal, and t € [0,7). The t-shift wt of w is the signal 
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wt: [0,T — t| — RY defined by w'(t’) := w(t +t). Let p be an STL for- 
mula. We define the robustness |w, yp] € RU {00, —co} as follows, by induction 
on the construction of formulas. [] and | | denote infimums and supremums of 
real numbers, respectively. N, the binary version of [], denotes minimum. 


w, f(%1,°-+, tn) > 0] := f(w(0)(21),°++ ,w(0)(xy)) 


w, L] := —oo [w,-y] :-= —[w, <] 
w, A; i) = [h [w, vil [w, Vi] = Lilw, vi] 
w, Ory] = Fernor lw", vl] [w, Ory] := Lerno, lw’: e] 


w, gı Ur p2] := Lhernpo,r)( [w ga] n Mep plwy] ) 


The original STL semantics is Boolean, given by a binary relation = between 
signals and formulas. The robust semantics refines the Boolean one in the fol- 
lowing sense: |w, y] > 0 implies w = y, and [w, y] < 0 implies w F y, see [19, 
Prop. 16]. 


2.1 Hill Climbing-Guided Falsification 


So far, the falsification problem has received extensive industrial and aca- 
demic attention. One possible approach direction by hill-climbing optimization 
is an established field, too: see [2—-4, 10, 13-15, 17,26, 29,36-39,42] and the tools 
Breach [13] and S-TaLiRo [4]. We formulate the problem and the methodology, 
for later use in describing our falsification approach. 


Definition 3 (Falsifying Input). Let M be a system model, and y be an STL 
formula. A signal u: [0,7] > RIY! is a falsifying input if [M(u), p] < 0; the 
latter implies M(u) j y. 


The use of quantitative robust semantics [M(u), y] € RU {oo, —oo} in the above 
problem enables the use of hill-climbing optimization. 


Definition 4 (Hill Climbing-Guided Falsification). Assume the setting in 
Definition 3, for finding a falsifying input, the methodology of hill climbing-guided 
falsification is presented in Algorithm 1. Here, the function HILL-CLIMB makes 
a guess of an input signal u’, aiming at minimizing the robustness [M(u’), y]. 
It does so, learning from the sampling history H that contains the previous 
observations of input signals and their corresponding robustness values. 


The HILL-CLIMB function can be designed based on various stochastic opti- 
mization algorithms. Typically, at the early phase of the optimization, the pro- 
posal of new input is usually based on random sampling; as the set of sampling 
history grows larger, the algorithm takes various metaheuristic-based strate- 
gies to achieve the optimization goal efficiently. Examples of such algorithms 
include Covariance Matrix Adaption Evolution Strategy (CMA-ES) [7] (used in 
our experiments), Simulated Annealing, Global Nelder Mead [32], etc. 
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Algorithm 1. Hill climbing-guided falsification 
Require: a system model M, an STL formula y, and a time budget 
1: function HILL-CLIMB-FALSIFY (M, p) 


2: initialize a placeholder u and rb +— co > the best input signal and robustness 
3: H — Ø >œ sampling history of input signals and robustness 
4: while rb > 0 and within the time budget do 

5: u’ — HILL-CLIMB(H) > run hill climbing based on sampling history 
6: rb’ — [M(w’), y] > compute robustness 
7: H — Hu {(u’,rb’)} > update sampling history 
8: if rb’ < rb then 

9: rb — rb’, u — u’ > update the best input and robustness 


if rb < 0 
10: return 
Failure otherwise, that is, no falsifying input found within the budget 


3 QB-Robustness 


The scale problem is a known important issue that negatively affects the perfor- 
mance of falsification, which arises when connective operators (i.e., conjunction 
and disjunction) with operands that predicate on different signals appear in the 
STL formula under falsification. According to the classic quantitative robust 
semantics (see Definition 2), the robustness of those formulas is calculated based 
on the comparison (minimum for conjunction, and maximum for disjunction) 
between robustness values coming from the different operand sub-formulas. How- 
ever, since different signals may differ in magnitude, the comparison may be 
biased, such that one signal w may always (or often) mask the contribution of 
the others, and, therefore, the final robustness may be dominated by this signal 
w. Note that, although the scale problem affects connective operators, it is not 
only local to the place of their application, but it is always propagated to the 
robustness of the whole formula. The scale problem has been shown as a root 
cause of the failure of many falsification problems [21,40]. 

In this work, we propose a novel approach for solving the scale problem in 
falsification. Our approach consists in introducing a new semantics for STL that 
does not suffer from the scale problem. Such new semantics will be used in a 
falsification approach based on Monte Carlo Tree Search. We describe details of 
the new semantics in this section, and the new falsification approach in Sect. 4. 

The new proposed semantics, called QB-Robustness, combines quantitative 
robustness and Boolean satisfaction. By construction, it never compares quanti- 
tative robustness values that come from different sub-formulas, thus avoiding the 
scale problem. QB-Robustness is defined for the whole STL formulas, except for 
the “until” operator pı Ur p2, when g1 is an arbitrary formula. We still support 
it as “eventually” and “always” operators?, i.e., when yı = T. Note that this is 
not a major limitation, as QB-Robustness still supports the majority of speci- 
fications that are used in industry: indeed, in the experiments, we were able to 


? Recall from Definition 1 that the “eventually” and “always” operators are defined 
in terms of the “until” operator. 
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handle all the specifications used in falsification competitions [18], which collect 
benchmarks from industrial case studies. 

To better explain the computation of QB-Robustness, we introduce some 
definitions. Let us first define the notion of immediate sub-formula for STL. 


Definition 5 (Immediate Sub-Formulas). Let y be an STL formula (see 
Definition 1). We define the set ISForm(y) of immediate sub-formulas of p as 
follows: 


ISForm(a) := Ø ISForm(L) := Ø ISForm(~y) := ISForm(y) 
ISForm( /\ yi) = {1,---, PK} ISForm( Vy pi) := {1,---,~K} 


i€{1,...,K} i€{1,...,K} 
ISForm(Oy) := |SForm(y) ISForm(Ory) := ISForm(y) 


Intuitively, the immediate sub-formula set of a connective (conjunction or dis- 
junction) contains all its operands. For the other unary operators (temporal 
operators, negation, etc.), its immediate sub-formula set is given by the imme- 
diate sub-formula set of its argument. 

The computation of QB-Robustness requires to select some nested immediate 
sub-formulas. To this aim, we introduce the notion of sub-formula sequence. 


Definition 6 (Sub-Formula Sequence). Let y be an STL formula. A sub- 
formula sequence X = 0, -...- OL w.r.t. p is defined as follows: 


a1 € ISForm(y) o141 € ISForm(o;) with }=1,...,L—-1 


where the - is the concatenation operator in the sequence. We use Xp to denote 
the kth element of X. Moreover, we denote the first element by “heag, and the 
last element by Sear. We use Xggzg to denote X without Vheag. We identify with £ 
the empty sequence; when |SForm(y) = Ø, we use € as its sub-formula sequence. 
We identify with X, the set of all the sub-formula sequences rooted in ¢. 


To be specific, in a sub-formula sequence X, each element is one of the sub- 
formulas of the previous element. This means that, for Boolean connectives, only 
one of the operands is selected. Moreover, an atomic sub-formula predicating 
over a single signal can only appear as the final element of a sequence. We 
exploit these characteristics of X to define QB-Robustness, which combines the 
quantitative robustness of the sub-formulas related to a given signal with the 
Boolean satisfaction of the other sub-formulas. QB-Robustness, given a sequence 
X, decides whether to compute the quantitative robust semantics or the Boolean 
semantics of a sub-formula, by considering whether the sub-formula belongs to X 
or not. This implies that, in the case of conjunction and disjunction, we evaluate 
the quantitative robustness of the sub-formula in X and the Boolean satisfaction 
of the other sub-formulas. Based on such intuition, we define the semantics of 
our proposed QB-Robustness in Definition 7, and demonstrate its usefulness in 
Theorem 1. 
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Definition 7 (Semantics of QB-Robustness). Let y be an STL formula as 
defined in Definition 1, and X be a sub-formula sequence w.r.t. p. For y = 
Avi | V pi let Yk € ISForm(y) be the first element Xheaa of X, then we can 
represent these two cases as Y = Yk ^ ye | Pk V YE, Where yz is the conjunction 
(or disjunction, respectively) of the other formulas in ISForm(y) \ {yx}. The 
QB-Robustness QBRob(w, p, X) of y w.r.t. X is defined as follows: 


QBRob(w,a,¢) := [w,al] QBRob(w, L,e) := —oo 
QBRob(w, ~g, X) := —QBRob(w, y, X) 

= a QBRob(w, pk, gza) if w = PE 
QBRob(w, vr A pp, 2!) = i otherwise 


QBRob(w, Yr, Yia if w JE pr 
QBRob(w, px V pp, 2) := Ve eee 


QBRob(w, Ory, X) := | |QBRob(w’, p, 5) 
tel 


QBRob(w, Ory, X) := | |QBRob(w', v, 5) 
tel 


We now prove that the semantics of QB-Robustness is equivalent (in the 
sense of satisfaction) to the Boolean semantics, and so it can be used to show 
violation of a specification in a falsification algorithm, as we do in this paper. 


Theorem 1. Let y be an STL formula. Given a signal w, for any X € Sy, it 
holds that QBRob(w, , X) > 0 implies w = y. Similarly, for any X € Ly, it 
holds that QBRob(w, p, X) < 0 implies w KK ọ. 


Proof. We first recall from [19, Prop. 16] that [w, y] < 0 implies w | y, and 
that [w,y] > 0 implies w |} y. We prove Theorem 1 by induction on the 
structure of the formula. 


— Case y = a. By Definition 7, QBRob(w,a,¢) > 0 indicates that [w,a] > 0 
and hence w — a, and QBRob(w, a,¢) < 0 that [w, a] < 0 and hence w F g. 

— For the following cases, let us assume that Theorem 1 holds for an arbitrary 
formula y’ and its sub-formula sequence X” that QBRob(w, y’, ©”) > 0 implies 
[w, y’] > 0, and that QBRob(w, y’, X”) < 0 implies [w, y’] < 0. We aim to 
prove that Theorem 1 also holds for y, resulting from the application of the 
operator in each of the following cases to y’, and X, the sub-formula sequence 
of y. 

e Case y = y’ AY, where y is an arbitrary formula. Let X = y’- X', and 
let us consider the two cases in which QBRob(w, y, X) is negative and 
positive separately: 

x If QBRob(w, y, X) < 0, there are two sub-cases: 
- if QBRob(w, y’, ©”) < 0, then [w,y’] < 0 (by assumption). 
Then, by the robust semantics of conjunction, also [w,y] < 0 
holds, and so it does w F y. 
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- if QBRob(w, vy’, ©”) > 0, then [w,y’] > 0 (by assumption). 
Then, it holds w Æ w by Definition 7, and, therefore, it holds 
wi y. 
x If QBRob(w, y, X) > 0, it means that QBRob(w, y’, ©”) > 0 and 
w H w (by Definition 7). By assumption, if QBRob(w, y’, ©”) > 0, 
then Iw, ¢] > 0. Therefore, w = y. 
e Case y = Ory’. Let X = ©”, and let us consider the two cases in which 
QBRob(w, p, X) is negative and positive separately: 
x By Definition 7, QBRob(w, y, X) < 0 indicates that there exists a 
t € I such that QBRob(w‘, y’, X) < 0. By assumption, it holds that 
wt k y. Then, by the semantics of the always operator O, it holds 
that w [Æ yọ. 
*« By Definition 7, ura Y w, 9, X) > 0 indicates that for all t € I it 
holds that QBRob(w' ,y’, X) > 0. Then, by assumption, it holds that 
for all t € I, wt H y’. So, by the semantics of the always operator O, 
it holds that w = y. 
e Case y = 7y’. Let X = X', and let us consider the two cases in which 
QBRob(w, p, X) is negative ‘amd positive separately: 
x By Definition 7, QBRob(w,y,5’) < 0 indicates that 
QBRob(w, y’, ©”) > 0. By assumption, it holds that w = y’, and 
therefore, w  y. 
x By Definition 7, QBRob(w,y,5’) > 0 indicates that 
QBRob(w, y’, ©”) < 0. By assumption, it holds that w jÆ y’, and, 
therefore, w — yp. 
e Proofs for the cases of y = y’ Vw and y = Ọrọ’ follow similar proof 
patterns, and so are left to the readers. 


We use an example to illustrate how QB-Robustness is used for checking the 
satisfiability of an STL formula. 


Example 2. Let w: [0,7] — R? be a 2-dimensional signal and y = 07(¥1 V p2) 
be an STL formula where pı and ye are two atomic formulas. Intuitively, to 
make y falsified, there must exist t € I such that wt j yı and wt K wo. Let us 
consider a non-trivial falsification problem in which, for most of the signals w 
sets {t € I | wt K pı} and {t € I | wt JÆ p2} are non-empty and disjoint. 

By Definition 7, given the sub-formula sequence X = g1 of y, the correspond- 
ing QB-Robustness is QBRob(w, y, X) = [],<,-QBRob(w*, p1 V p2, 91), i.e., it 
takes the infimum of QBRob(w‘, %1 V p2, Y1) over t € I. Again, by Definition 7, 
for any t’ € I, QBRob(w" , p1 V p2, 91) is computed as follows: 


— if for a t' € I it holds w” | go, then QBRob(w* , p1 V %2,1) = co. Then, it 
is impossible that [],<;QBRob(w’, 1 V p2, %1) is given by QBRob(w" , p1 V 
P2, 91); , ; 

—if fr a t € I it holds wt |Æ go, then QBRob(wt ‚p1 V 92,91) = 
QBRob(w* ,y1,¢) = [wt , p1]. In this case, QBRob(w* , p1 V v2, 1) has a 
chance to determine the value of [],<;-QBRob(w’, p1 V p2, %1). 
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Therefore, when X = 1, it holds that QBRob(w, p, X) = [le s[w*, p1], where 
S = {t € I | wt K p2}, i.e., the infimum of the quantitative robustness of p1 
on the interval when yə is violated. Indeed, once this value is negative, it means 
that there exists a point t € J when both y; and 2 are violated; by the Boolean 
semantics of always and disjunction, ọ is violated. 


4 MCTS-Based Falsification Guided by QB-Robustness 


QB-Robustness never compares robustness values coming from signals with dif- 
ferent magnitudes, and, therefore, it does not suffer from the scale problem. As 
such, it could be used in falsification approaches instead of the classical pure 
quantitative robustness. 

However, a sub-formula sequence X is required when calculating QB- 
Robustness, and such sequence is not unique (see Definition 7). Note that the 
selection of the sequence can affect the performance of the numerical optimiza- 
tion algorithms used in falsification. Let us consider y = O((gear < 6)A(speed < 
130)) as an example. As explained in Sect.1, numerical optimization will perform 
better if guided by the robustness values coming from speed rather than by those 
coming from gear. Therefore, in a falsification approach using QB-Robustness, 
it is important to select an appropriate sub-formula sequence X. 

By using the QB-Robustness, the problem of falsifying an STL formula y 
consists in finding both a signal w and a sub-formula sequence X such that 
QBRob(w, y, X) < 0. The selection of X is discrete, while the search for w is 
numerical. In order to combine these processes that are different in nature, we 
propose to adapt Monte Carlo Tree Search (MCTS) [8,28]. In the following, we 
firstly give a brief introduction to MCTS in Subsect. 4.1, and then present the 
application of MCTS to our falsification problem in Subsect. 4.2. 


4.1 MCTS Background 


MCTS exemplifies the “trial and error” philosophy, and has achieved a great 
success over the past decade, most notably in fields such as the computer Go 
game [35]. MCTS explores the action space given by the possible actions of 
the system; for example, in the Go game, these are the positions where to put 
the next stone. The approach builds a tree of sequences of actions, and assigns 
rewards to the different branches. MCTS performs the search by iteratively tak- 
ing the following four steps. See Fig. 1, where the general scheme is adapted to 
our current setting, for illustration. 


— Selection. It selects a node to expand or to reason about. Initially, selection has 
no other choice than the root. When a node has multiple expanded children, 
selection will be done according to the UCB1 [6] algorithm. 

— Expansion. Child expansion happens after selection if the selected node has 
unexpanded children. A child will be added to the tree during expansion. 

— Playout. After a node is just expanded or a leaf is reached, playout is per- 
formed for evaluating the node. The evaluation is given by a reward, which 
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Initialization Expansion Playout Backpropagation Selection Termination 
p 
P P % selection by 
UCBI algorith 
pı pı / pı 
> > —> -> 
pu Pu Pu 


Q, ll climbing guided 
Ri 


hi 
by QB-Robustness 


a falsifying input found! 


Fig. 1. The workflow of MCTS-based falsification guided by QB-Robustness. Let us 
consider the falsification of an STL formula y = Oz (¢1 V p2), where yi = p11 A ¢12. 
Initially, there is only the root in the tree, so the algorithm selects it for expansion. 
Then, the algorithm keeps on randomly selecting a child of a non-fully expanded node, 
until a leaf node is reached. By reaching a leaf, a sub-formula sequence X has been 
constructed; the algorithm performs playout using X, by running hill-climbing opti- 
mization guided by the QB-Robustness with X, to estimate the reward of the path. 
After playout, the algorithm backpropagates the reward and the number of visits from 
the leaf to the root. When all the children of a node are expanded, selection is done 
based on the UCB1 algorithm. After many loops, the algorithm has explored all the 
possible sub-formula sequences in X4, and it starts allocating more resources to those 
branches where hill-climbing optimization progresses more smoothly. The algorithm 
terminates either when a falsifying input is found, or when the budget is exhausted. 


is a real number in [0,1]. Reward can be interpreted differently in different 
contexts. For example, in the Go game, the reward of a position is measured 
by the winning rate when a stone is positioned there; this is estimated by 
randomly playing the game until the end for n times, and then taking the 
ratio ““ of the number of winning as the winning rate. 

— Backpropagation. Backpropagation updates the number of visits and the 
reward of the nodes along the path from the node of playout to the root. 
These data are used in subsequent loops to decide the branches to explore. 


At the end, the action space will be sufficiently explored in an unbalanced man- 
ner, by focusing on the most promising sub-spaces giving the highest rewards. 


4.2 Proposed QB-Robustness-Guided Falsification Approach 


We here propose a falsification framework based on MCTS in which, during tree 
construction, we synthesize and select a sub-formula sequence that facilitates 
the falsification progress the most, and, at the bottom layer of the tree, we run 
numerical optimization to search for a falsifying input and provide feedback (i.e., 
backpropagation) to guide the sequence selection. 

We formalize our algorithm in Algorithm 2 and visualize its execution in 
Fig. 1. In the following, we elaborate on our approach. 
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Algorithm 2. MCTS-based falsification guided by QB-Robustness 
Require: a system model M, an STL formula ¢, and the following tunable parame- 
ters: a scalar c for UCB1, an MCTS budget Bm, and a playout budget Bp. 


1: function MCTS 

2: Sint ep > the root denoted as a sequence with y only 
3 FH fe > the MCTS search tree, initially root only 
4: N = (S10) > visit count N initialized, defined only for root 
5: R — (5"> 0) > reward function R initialized 
6: H — (x > Ø) > the sampling history of hill climbing 
T while ọ not falsified and within the MCTS budget Bm do 

8 MCTSSrarcu(S"") 


9: function MCTSSEARCH() 


10: if ISForm( rear) Æ Ø then > the node has children 
11: if X. yp E€ T for all ye € ISForm( tear) then > all children expanded 
2la N(X) . 

12: Pk — arg max R(X- pi) + ey | = | P selection by UCB1 
pi EISForm( rear) ( N(X: i) 

13: else >œ unexpanded children exist 

14: randomly select yr from {pp € ISForm( Sear) | X -pk ZT} 

15: T—TU{D:- vx} > expand a new child 

16: N(2'- y,) — 0 

17: H(X - yr) — Ø 

18: MCTSSEARCH(X - px) > recursive call 

19: R(X) — max R(X- pp) > back propagation for reward 
Pk ElSForm( Sear) 

20: else > a leaf node reached 

21: while within playout budget Bp do > playout by hill-climbing falsification 

22: u — HILL-CLIMB(H(X)) > hill-climbing 

23: rb — QBRob(M (u), p, Yrza) 

24: if rb < 0 then > falsifying input found 

25: return (u, rb) 

26: H(X) — H(Z)U {(u, rb)} > record sampling history 

27: R(w) — Rwd(rb, H(X)) 

28: N(X) — N(X)+1 > back propagation for visit count 


We construct the tree in this way: each node represents a sequence of formu- 
las, and each edge of a node is a sub-formula of the last element of the sequence 
represented by the node. The root is initialized with a sequence holding y only 
(Lines 2-3) and some other properties including the number of visits to the differ- 
ent nodes (Line 4), the reward (Line 5), and the history of hill-climbing sampling 
(Line 6). The main process of MCTS consists in calling the MCTSSEARCH func- 
tion iteratively with the root as argument (Line 8), until the exhaustion of the 
MCTS budget or a falsifying input is found (Line 7). The MCTSSEARCH func- 
tion (Line 9) goes through the four phases, namely selection, expansion, playout 
and backpropagation, of the original MCTS algorithm. 
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Selection. Selection happens when a node has children (Line 10) and these 
have all been expanded (Line 11). It selects a child according to the UCB1 [6] 
algorithm (Line 12) to take a balance between exploration and exploitation. The 
exploitation is embodied by the reward R(X - y;)—the higher the reward is, 
the more likely a falsifying input is found following that branch. Exploration, 
2In N(3) 
N(X-yi) 
of visits to a child—the more the child was visited before, the less chance it will 
be visited again. The scalar c is a tunable parameter that balances the trade-off 
between exploration and exploitation. After a child X - pẹ is selected, it will be 
taken as the argument of the next MCTSSEARCH loop (Line 18). 


instead, is considered via 


that is negatively correlated to the number 


Expansion. If not all the children of a node have been expanded (Line 13), a 
child will be expanded. Expansion consists in randomly selecting a child from 
the unexpanded child list (Line 14), adding it to the tree (Line 15), initilizing 
properties including the number of visits and history (Lines 16-17). After expan- 
sion, the newly expanded child will be taken as the argument of the recursive 
call to MCTSSEARCH (Line 18). 


Playout. If a leaf node that has no children to expand is reached, the playout 
phase will start to devise a reward for evaluating the visited path. In our context, 
we define the reward based on the best robustness value that can be obtained 
with the path; specifically, playout consists in running hill-climbing guided falsifi- 
cation to search for a minimal robustness value (Line 22). Note that the sequence 
X represented by a leaf node is actually the concatenation between y and a sub- 
formula sequence of y. We extract the suffix of X, i.e., the sub-formula sequence, 
to compute the QB-Robustness as a guidance to the hill-climbing optimization 
(Line 23). If a negative QB-Robustness is found (Line 24), then the whole algo- 
rithm can be terminated and the input signal u that triggers the negative QB- 
Robustness can be returned as the falsifying input (Line 25); otherwise, the 
sampling history of hill climbing will be saved (Line 26) so that the future play- 
out at the same leaf can be restored from that context. After playout, the reward 
of the leaf node will be updated based on the definition of the reward, which will 
be introduced below. Reward Since our goal is to find a sequence X with which 
hill-climbing optimization can minimize QBRob(w, y, X) smoothly, we connect 
the reward with the hill-climbing progress. Formally, given a sampling history 
H, our reward (Line 27) is defined as Rwd(rb’,H) := == Pace) 
where rb, is the history of robustness values in H. 


Backpropagation. In MCTS, the playout result of a leaf is backpropagated to 
the higher layer nodes along the path, so that the future selection on the high 
layer is referred. Backpropagation updates two properties of each ancestor of the 
leaf till the root, the reward (Line 19) and the number of visits (Line 28). 


Remark 1 (Approach Complexity). With respect to classical falsification, 
our approach introduces an exploration phase for searching the “best” sub- 
formula sequence to instantiate QB-Robustness. The number of these sequences 
corresponds to the number of atomic sub-formulas (and so the leaves of the 
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Table 1. Benchmarks — STL specifications 


Spec. ID Temporal specification in STL Spec. ID Temporal specification in STL 

AT1 Oio,30] (gear = 4 > speed > 35) AT8 Ojo,10] (speed < 50) V Ofo,30) (rpm > 2520) 

AT2 [0,30] (gear = 4 > Qjo,5] (rpm < 4300)) AT9 hg 30] (speed < 50 V speed > 60 V rpm < 1000) 

AT3 Olp,30) (speed < 130 A gear < 5) AT10 Olp,30) (gear = 4 — (speed > 35 A Ojo,5) (rpm < 4000))) 
AT4 Ojo,30] (speed < 135 A < 4780) AT11 Ofp,30] (fo,8] (gear = 1 — (speed < 20 A rpm < 600))) 
AT5 Ojo,30] (rpm < 600 — Ofo,10) (gear > 1)) AT12 Ojo,30] (gear < 3) V Olo,30] (speed < 135 A rpm < 4780) 
AT6 Ofo,30) (f0,5](speed < 120 V rpm > 3500)) AT13 Ojo,30] ((gear = 4 > Qjo,5] (rpm < 4000)) A gear < 5) 
AT7 [0,30] (rpm < 4750 A gear < 5) AT14 Olp,30] (throttle = 0 V brake = 0) — Op,30] (speed < 110) 


Spec. ID Temporal specification in STL 


AT15 — Oo,30)((rpm < 4770 V Op,1) (rpm > 1000)) A {0,5 (gear < 5)) 
AT16 Oho,30) (gear = 4 > ((Of0,5] (rpm < 3000) A (gear = 2 — speed < 20)))) 
AT17 Oho,5] (speed < 70 A gear < 4) \Oh10,20)(rpm < 4780) A Oj25,30 (speed < 130) 


[i 
[i 
[i 
AT18 Ojo,30] ((gear = 4 > Qjo,5] (rpm < 4250)) A (gear = 3 > Qjo,5] (rpm < 4700)) A (gear = 2 > Qjo,5] (rpm < 4800))) 
[i 
[i 
[i 
[i 


AT19 Ofo,30) ((gear = 1 > speed < 80) A (gear = 2 — speed < 90) A (gear = 3 — speed > 20) A (gear = 4 — speed > 30)) 
AT20 Oho,29) (speed < 100) V Oj29,30) (speed > 64) A Ofo,30) (rpm < 4770 V Ojo,ı} (rpm > 700)) 

AT21 Ofo,30) (throttle > 90 > Qjo,10] (throttle < 30)) > Ojo,30] (gear = 4 > Qto,s) (rpm < 4000)) 

AT22 Ojo,30] (throttle > 70 > Qjo,10, (brake > 50)) > Op,30 (gear = 4 > speed > 35) 


Spec. ID Temporal specification in STL 


AFC1 e (mode = 1 > pu < 0.228) 

AFC2 9j0,50) (PedalAngle > 40) > Ofi,s0) (# < 0.225) 

AFC3 9J0,50) (EngineSpeed > 1000) — Ofi1,50) (# < 0.225) 

AFC4 Ofo,50) (EngineSpeed > 910 V PedalAngle > 25) — Op1,50] (# < 0.225) 

AFC5 9Jo,50) (PedalAngle > 40) > Ofr1,50) (Oj0,8] (4 < 0.06)) 

AFC6 jo,50) (PedalAngle > 40 A EngineSpeed > 1000) > Op1,50] (Oj0,8] (e < 0.06)) 


Spec. ID Temporal specification in STL 


FFR1 Olo,s)((ul, u3 > OV ul, u3 < 0) A (u2, u4 > 0 V u2, ud < 0)) > Olo,5)(A(@1 > 3.9 A x1 < 4.1) A ~(z3 > 3.9 A x3 < 4.1)) 
FFR2  ~(0o,s(Op.(£1 > 1.5 A al < 1.7 A 23 > 1.5 A £3 < 1.7))) 


tree). Considering that most of the time is spent on playout, the complexity of 
our approach grows linearly with the number of atomic sub-formulas. 


5 Experimental Evaluation 


In this section, we present the experiments we conducted to evaluate the effec- 
tiveness of the proposed approach. We first introduce the experiment setup 
in Subsect.5.1, and then we present the experimental evaluation results by 
answering three research questions in Subsect. 5.2. 


5.1 Experiment Setup 


Simulink Models and Specifications As our benchmarks, we selected three 
Simulink models frequently used in the falsification community (i.e., in the fal- 
sification competitions [18]), and 30 specifications defined for them. All these 
models are complicated hybrid systems with multiple input and output signals. 
The specifications are STL formulas that formalize system requirements regard- 
ing safety, performance, etc. Since we are interested in assessing the influence of 
the scale problem to the performance of the compared falsification approaches, 
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all the considered specifications predicate over, at least, two signals. Table 1 
reports the 30 specifications under test. The IDs of the specifications identify 
which models they belong to. A description of the three models and of their 
specifications is as follows. 


— Automatic Transmission (AT) [24] has two input signals, throttle € [0,100] 
and brake € [0,325], and three outputs signals including gear, speed and 
rpm. Most of the specifications we used formalize safety requirements of the 
system. For instance, AT2 requires that when gear is as high as 4, rpm 
should not be larger than 4300; AT3 is an adaptation of the example we used 
in Sect. 1; AT10-12 reason about the relationship among the three output 
signals; AT17 specifies three properties for three different time intervals; AT18 
specifies different properties for different values of gear; AT14, AT21 and 
AT22 impose logical constraints on input signals, in addition to the property 
under consideration. 

— Abstract Fuel Control (AFC) [25] takes two input signals, PedalAngle € 
(8.8, 70] and EngineSpeed € [900,1100], and outputs a ratio u reflecting the 
deviation of air-fuel-ratio from its reference value. The basic safety require- 
ment to this system is that u should not be deviated from the reference value 
too much (AFC1); in addition to that, our specifications also reason about 
the resilience of the system (AFC5 and AFC6), and impose input constraints 
(AFC2-6). 

— Free Floating Robot (FFR) [11] models robot moving in a 2-dimentional space. 
It has four input signals u1, u2, u3, u4 € [—10, 10] that are boosters for a robot, 
and four output signals that are the position in terms of coordinate values 
x,y and their one-order derivatives t, y. The specifications regulate the kinetic 
properties of the robot: FFR1 requires the robot to pass an area around the 
point (4,4) under an input constraint, and FFR2 requires the robot to stay 
in an area for at least 2s. 


Baseline Approach and Our Proposed Approach. In our experiments, we compare 
the performances of our proposed approach with the baseline Breach approach. 
We implemented our approach in the tool ForeSee, which stands for FORmula 
Exploitation by Sequence trEE for falsification. 

Breach is a state-of-the-art falsification tool that implements the classic fal- 
sification workflow we introduced in Sect.2. The quantitative robustness calcu- 
lation in Breach is based on the robust semantics given in Definition 2. Breach 
also encapsulates several stochastic optimization algorithms, such as CMA-ES, 
Simulated Annealing, etc. The implementation of our ForeSee approach uses 
Breach only for interfacing with the Simulink model and for the calculation 
of quantitative robustness; instead, the calculation of QB-Robustness, and the 
implementation of the MCTS algorithm are novel. Since CMA-ES has proved 
to be the state-of-the-art stochastic algorithm [39], we select CMA-ES as our 
backend optimizer for the playout phase.’ 


3 ForeSee is available at https://github.com/choshina/ForeSee. 
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We apply the two approaches, ForeSee and Breach, to each benchmark spec- 
ification reported in Table 1. Since both approaches are based on stochastic opti- 
mization, we repeat each experiment for 30 times, as suggested by a guideline for 
conducting experiments with randomized algorithms [5]. For each experiment, 
both approaches have been given a total timeout Bm of 900s (see Algorithm 2). 


Evaluation Metrics. As first evaluation metric, we compute the falsification rate 
(FR) as the number of runs (out of 30) in which the approach returns a falsifying 
input. Therefore, FR is an indicator of the effectiveness of an approach, i.e., it 
reflects the ability of an algorithm to falsify the specification. As second evalua- 
tion metric, we compute the average time (seconds), as average execution time 
of the successful falsification runs. Therefore, the average time is an indicator 
of the efficiency of the approach. We do not report the number of simulations 
because these are consistent with the execution time. 


Experiment Platform. In our experiments, we use Breach [13] (ver 1.2.13) with 
CMA-ES (the state of the art). Breach accepts piece-wise constant signals as 
input for the Simulink models; we use the same settings used in falsification 
competitions [18]: we use piece-wise constant signals with five control points for 
AT and AFC, and with four control points for FFR. As configuration of MCTS 
(see Algorithm 2), we set the UCBI1 scalar c to 0.2, and the playout budget 
Bp to 10 generations. The experiments have been executed on an Amazon EC2 
c4.2xlarge instance (2.9GHz Intel Xeon E5-2666 v3, 15 GB RAM). 


5.2 Evaluation 
We here analyze the experimental results using three research questions (RQs). 


RQ1. Does the proposed approach perform better than state-of-the-art falsifica- 
tion approaches? 

In this RQ, we aim at assessing whether the proposed approach is indeed able 
to tackle the scale problem in falsification and performs better than state-of-the- 
art approaches. Table 2 reports, for each specification benchmark, the falsifica- 
tion rate FR and the average execution time of our proposed approach ForeSee 
and of the baseline Breach. The table further reports the difference of the two 
metrics between the two approaches. We highlight in gray the best results in 
which ForeSee has an FR of 15 units higher than Breach. We observe that 
for 25 benchmarks out of 30, ForeSee has a better FR, and in 15 of these the 
improvement is significant (selected in gray). Note that there are notable cases, 
such as AT3, AT13, AT16, and AT17, in which Breach only finds at most two 
falsifying inputs, while ForeSee finds always at least 29 falsifying inputs. In four 
cases, Breach has a better FR: while for AT8, AFC6, and FFR2 the difference 
is minimal, it is quite large for AT14. We further inspected such specification 
and its corresponding model (see Table 1); we noticed that all the sub-formulas 
in AT14 must be falsified to falsify the whole specification’, and they are all 


t Note that all binary connectives of AT14 are disjunctions; indeed, A — B is the 
syntactic sugar for Al |». 
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Table 2. Falsification performance comparison between Breach and ForeSee on bench- 
marks. Timeout: 900 s. FR in (/30), time in secs. 


Breach ForeSee Breach ForeSee 
FR time FR time | AFR Atime FR time |FR time | AFR Atime 


AT1 12 67.0 |29 90.3 |+17 +23.3 AT12 5 379.2} 28 381.4| +23 +2.2 
AT2 18 208.5 30 155.5/+12 -53.0 AT13 2 75.2 |29 98.3 |427 +23.1 
AT3 0 - 29 87.3 |+29 - AT14 24 184.9|1 601.5/-23 +416.6 
AT4 8 414.0 30 376.6|+12 -37.4 AT15 1 66.1 331.8 +8  +265.7 
AT5 13 44.7 30 159.0|+17 +1143 AT16 1 13.0 |30 6.7 |+29 -6.3 
AT6 14 630.5 20 545.9|+6 -84.6 AT17 0 - 30 208.8|+30 - 

AT7 20 24.9 30 5.8 |+10 -19.1 AT18 18 160.0|24 234.3, +6 +74.3 
AT8 17 418.513 547.0 |-4 +128.5 AT19 15 81.8 |30 154.3|+15 +72.5 
AT9 9 298.6 29 208.0|+20 -90.6 AT20 1 97.7 |5 286.2|+4 +188.5 
AT10 14 99.4 30 89.7 |+16 -9.7 AT21 10 239.0|29 425.5|+19 +186.5 
AT11 17 58.1 |30 39.6 |+13 -18.5 AT22 13 72.0 |30 113.3|+17 +41.3 


Ne} 


Breach ForeSee Breach ForeSee 
FR time |FR time | AFR Atime FR time |FR time | AFR Atime 


AFC1 10 532.2}12 458.0|+2 -74.2 AFC4 7  634.5|22 500.3 +15 -134.2 
AFC2 12 546.9|30 218.3|+18 -328.6 AFC5 8 576.9|9 322.0 +1 -254.9 
AFC3 8 727.6|28 232.5|+20 -495.1 AFC6 10 518.2|6 344.2 -4 -174.0 


Breach ForeSee Breach ForeSee 
FR time FR time | AFR Atime FR time FR time | AFR Atime 


FFR1 7 132.1 /7 399.3 | +0 +267.2  FFR2 30 38.0 |27 348.0 | -3 +310.0 


difficult to be falsified. In such a case, there is no best sub-formula sequence X: 
therefore, the time spent by ForeSee in exploring different X does not provide 
any improvement. 

Regarding the time execution, there is no clear trend among the different 
results: sometimes ForeSee is faster, other times Breach is. However, even in 
the cases in which ForeSee is slower, it is still below the timeout by which it 
manages to find a falsifying input (so, leading to better falsification rates). 


RQ2. Does the proposed approach solve the scale problem effectively? 

The benchmarks reported in Table1 and experimented in RQ1, predicate 
over signals having different scales and so they suffer from the scale problem. 
RQ1 showed that ForeSee is very efficient in falsifying them. In this RQ, we want 
to make a more systematic study of the effects of the scale problem; indeed, the 
scale problem could manifest itself in different ways, depending on the difference 
of the order of magnitudes of the different signals (e.g., speed [km/h] vs. rpm, 
or speed [km/h] vs. rph). To assess this, we take six specifications from Table 1 
and we artificially modify their outputs: namely, we multiply by 10% (with dif- 
ferent k values depending on the specification) the speed of AT1, AT3, AT4, 
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Table 3. Falsification performance under different scales. Each rescaled signal is 
rescaled by 10*. 
(a) AT1: speed x 10* (b) AT3: speed x 10* (c) AT4: speed x 10* 


Breach ForeSee Breach ForeSee Breach ForeSee 
k FR time |FR time k FR time |/FR time k FR time | FR time 


-2 30 126.5/26 77.5 -3 30 124.9 30 81.2 -2 29 247.2|29 329.4 
-1 25 64.4 |29 107.9 -2 30 135.928 82.6 -1 29 243.5|28 332.2 
0 12 67.0 |29 90.3 -1 1  136.7/28 101.6 0 8 414.0|30 376.6 
1 9 92.4 |28 81.8 0 0 - 29 87.3 1 0 - 29 377.6 
2 9 131.9|/30 94.2 1 0 - 30 103.4 2 0 - 29 333.2 
min 9 64.4 |26 77.5 min 0 124.9 28 81.2 min 0 243.5|28 329.4 
max 30 131.9/30 107.9 max 30 136.7/30 103.4 max 29 414.0/30 377.6 
mean 17 96.4 |28 90.3 mean 12 132.5 29 91.2 mean 13 301.6|29 349.8 


(d) AT9: speed x 10* (e) AT15: rpm x 10* (f) AFC3: EngineSpeed x 10" 


Breach ForeSee Breach ForeSee Breach ForeSee 
k FR time |FR time k FR time |FR time k FR time | FR time 


-1 11 202.6|28 259.8 -5 20 138.36 222.3 0 8 727.6|28 232.5 
0 9 298.6|/29 208.0 -4 13 158.1)10 258.8 -1 18 574.2|29 284.1 
10 197.4|/29 221.2 -3 4 1446 5 313.6 -2 29 401.2|29 211.5 


28 175.4|29 248.9 -2 0 - 9 268.6 -3 30 215.0/29 230.1 
3 30 162.6/29 209.6 0 1 66.1 /9 331.8 -4 29 198.2|30 236.2 
min 9 162.6/28 208.0 min 0 66.1 |5 222.3 min 8 198.2/28 211.5 


max 30 298.6/29 259.8 max 20 158.1/10 331.8 max 30 727.7|30 284.1 
mean 18 207.3|29 229.5 mean 10 126.88 279.0 mean 23 423.2|29 238.9 


and AT9; the rpm of AT15; and the EngineSpeed of AFC3. For each artificial 
rescaling, both the Simulink model and the specification have been changed.’ 
We run ForeSee and Breach on these rescaled benchmarks. Table 3 reports the 
experimental results for each k. The table also reports the minimum, maximum, 
and mean results for FR and execution time. We observe that the performance of 
Breach, in terms of FR, is very sensitive to the scale problem. Indeed, for all the 
specifications, FR decreases with increasing or decreasing k; notable examples 
are AT3 and AT4 in which Breach can (almost) always falsify with the mini- 
mum k, but never falsifies with the maximum two k. This is the demonstration 
of the effects of the scale problem on falsification approaches that only rely on 
quantitative robust semantics where the robustness values of different signals are 
compared. By looking at the results of ForeSee, instead, we observe that it is 
much more robust and its FR performance is independent of the applied rescal- 
ing. This clearly shows that our falsification approach guided by QB-Robustness 
is successful in avoiding the scale problem. 


5 Note that k = 0 corresponds to the experimental result in Table 2, and we report it 
again for reference. 
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Table 4. Falsification performance under different MCTS hyperparameters. 


(a) Performance with varied c (b) Performance with varied Bp 


AT17 AT19 AT21 AT17 AT19 AT21 
c FR time | FR time |FR time Bp FR time |FR time |FR time 


0 23 177.8|30 224.6|30 463.4 2 26 385.2/30 162.0|/29 500.0 
0.02 26 196.7/30 278.5|28 501.3 5 30 347.7|29 207.3|29 472.5 
0.2 30 208.8/30 154.3|29 425.5 10 30 208.8/30 154.3|29 425.5 
0.5 30 297.0/29 227.3|30 509.0 15 30 337.7|29 336.7|28 514.0 
1.0 30 311.7|30 240.2|24 497.0 20 30 358.1|30 313.5|30 511.0 


These results also allow us to show that the naive approach based on nor- 
malization for solving the scale problem does not work, as also reported in [41]. 
Indeed, one may think that a solution for tackling the scale problem could be 
to rescale the signals in a way to make them have the same order of magni- 
tude. This is not a good approach. Let us consider the results in Table 3c for 
AT4 (Ojo,30] (speed < 135 A rpm < 4780)). In this case, speed is multiplied by 
10%. We may think that the best falsification result should occur when speed is 
multiplied by 107, because this would make the two signals both in the order of 
thousands. However, this rescaling is the one giving the worst result. The best 
result is actually given by the rescaling making speed even smaller (i.e., k = —2 
and k = —1). This means that the correct way for handling the scale problem 
cannot be identified in advance, but we need an approach as ours that learns 
during falsification the best strategy. 


RQ3. How do the hyperparameters of MCTS influence the performance of the 
proposed approach? 

Our proposed approach is an instantiation of the Monte Carlo Tree Search 
(MCTS) method [8,28] that can be configured with some hyperparameters, 
namely the scalar c used by UCB1 (Line 12 in Algorithm 2), and the playout 
budget Bp (Line 21 in Algorithm 2), both used for balancing between explo- 
ration and exploitation. Therefore, the performance of MCTS could be affected 
by the values used for these hyperparameters. In this RQ, we try to assess 
this. We selected three benchmarks specifications (AT17, AT19, and AT21) and 
varied one hyperparameter while keeping the other fixed. Namely, we experi- 
mented with c € {0,0.02,0.2,0.5,1} and budget Bp = 10 (see Table4a), and 
with Bp € {2,5, 10,15,20} and budget c = 0.2 (see Table 4b). Looking at the 
results of Table4a for AT17 and AT21, there seems to be some influence by 
the scalar c. In AT17, the worst result in terms of FR is obtained when c is 0, 
meaning that MCTS only focuses on exploitation. AT17 is a specification that 
suffers from the scale problem, as shown by the very bad performance of Breach 
in Table 2; for such a specification, we need to perform some ezploration to find 
the best X: this explain the low performance of MCTS with c = 0. On the other 
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hand, the worst FR performance of AT21 is given by the highest value c = 1 
that requires MCTS to spend a lot of time in exploration. Since AT21 is not an 
extremely difficult specification (indeed Breach has FR of 10 in Table2), such 
very conservative approach does not pay off, while more greedy approaches (i.e., 
with lower c) have better performance. 

Looking at the results of Table 4b related to Bp, it seems that there is no 
too much influence. The only difference is given in AT17 with Bp = 2 where 
the FR is slightly lower than the other cases. This means that, provided that a 
sufficiently large value for Bp is given, ForeSee is not too sensitive to it. 


6 Related Work 


Quality assurance of CPS has been actively studied, due to its great signifi- 
cance. Different approaches, including but not limited to model checking, theo- 
rem proving, rigorous numerics, and nonstandard analysis [9, 16,20, 22, 23, 31,33], 
have been proposed to solve the problem. However, due to the scalability issue 
and existence of black-box components, those approaches are not widely applied 
in the real-world systems. 

The optimization-based falsification approach inherits the search-based test- 
ing methodology, and is much more scalable than pure verification-based 
approaches. The key issue of search-based testing is the exploration-exploitation 
trade-off. This issue has been discussed for the verification of quantitative prop- 
erties (e.g., [34]). In the falsification community, there have also been a lot 
of works focusing on that, and these works tackle the problem from different 
perspectives. Metaheuristics refers to high-level heuristic strategies that utilize 
heuristics to improve the search efficiency. Several metaheuristic strategies have 
been applied in falsification, such as Simulated Annealing [1], tabu search [10], 
and so on. Coverage-guided falsification [2,10,15,29] aims to guide the search 
using some coverage metrics, so that the search space is sufficiently explored. 
Recently, machine learning techniques have also been applied to falsification to 
enhance the search ability. For instance, Bayesian optimization [3, 11,36] utilizes 
an acquisition function to balance exploration and exploitation; Reinforcement 
learning [27,37] naturally emphasizes on exploration. 

The scale problem is a recognized issue [12,21,40] that is known to severely 
affect the performance of falsification. In [40], we proposed a multi-armed bandit 
approach to solve the problem in a specific setting, that is, safety properties with 
Boolean connectives: Oz (yi A p2) and Oz (%1 V Y2). The approach is not appli- 
cable to formulas having more nested sub-formulas, or even connectives having 
more operands; therefore many of the benchmarks we used in Subsect.5.2 fall out 
of the scope of [40]. The techniques introduced in [12,21] rely on explicit declara- 
tion of input vacuity and output robustness. Compared to their approaches, our 
method does not need that, but we learn the significance of each signal through 
tree exploration and reward computation. 

MCTS, as an effective search framework, has been applied in testing hybrid 
systems. In [30], the authors applied an adaption of MCTS in testing, namely, 
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adaptive press testing, to detect the potential dangerous cases of airborne col- 
lision. A recent study of MCTS on hybrid system falsification is [39]. There, 
the authors discretized the search space to construct the search tree, and then 
applied MCTS to explore different sub-spaces. Compared to their approach, our 
work aims to tackle the scale problem and so we exploit the structure of speci- 
fication formulas to construct the tree search framework. 


7 Conclusion and Future Work 


Optimization-based falsification is a widely used approach for quality assurance 
of CPS, that tries to find an input violating a Signal Temporal Logic (STL) 
specification. It does this by exploiting the quantitative robust semantics of the 
specification, trying to minimize its robustness. The performance of falsification 
is affected by the scale problem in the presence of the comparison of robust- 
ness values of sub-formulas predicating over signals having different scales. In 
this paper, we propose QB-Robustness, a new STL semantics that does not 
suffer from the scale problem, because it avoids such comparison. The compu- 
tation of QB-Robustness requires to specify a sub-formula sequence telling for 
which sub-formulas the quantitative robustness must be computed. We then 
propose a Monte Carlo Tree Search (MCTS)-based falsification approach that 
synthesizes a sub-formula sequence for QB-Robustness, and uses this for guiding 
numerical optimization. Experimental results show that the proposed approach 
achieves better falsification results than a state-of-the-art falsification tool that 
uses standard quantitative robust semantics. 

In the analysis of RQ1, we observed that, when the specifications have a par- 
ticular structure, our approach has no advantage and, actually, it could decrease 
the performance by trying to find a best sub-formula sequence that does not 
exist for the current initial sampling. As future work, we plan to devise some 
heuristics that could handle these cases: for example, we could perform a better 
initial sampling (see Subsect. 2.1) that could provide a better initial guidance. 
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Abstract. Given the versatility of timed automata a huge body of work 
has evolved that considers extensions of timed automata. One extension 
that has received a lot of interest is timed automata with a, possibly 
unbounded, stack, also called pushdown timed automata (PDTA). While 
different algorithms have been given for reachability in different variants 
of this model, most of these results are purely theoretical and do not give 
rise to efficient implementations. One main reason for this is that none of 
these algorithms (and the implementations that exist) use the so-called 
zone-based abstraction, but rely either on the region-abstraction or other 
approaches, which are significantly harder to implement. 

In this paper, we show that a naive extension, using simulations, of 
the zone based reachability algorithm for the control state reachability 
problem of timed automata is not sound in the presence of a stack. To 
understand this better we give an inductive rule based view of the zone 
reachability algorithm for timed automata. This alternate view allows 
us to analyze and adapt the rules to also work for pushdown timed 
automata. We obtain the first zone-based algorithm for PDTA which 
is terminating, sound and complete. We implement our algorithm in 
the tool TChecker and perform experiments to show its efficacy, thus 
leading the way for more practical approaches to the verification of timed 
pushdown systems. 


Keywords: Timed automata - Zone-based abstractions - Pushdown 
automata + Simulations - Reachability 


1 Introduction 


Timed automata [7] are a popular formalism for capturing real-time systems, and 
of use for instance, in model checking of cyber-physical systems. They extend 
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finite automata with real variables called clocks whose values increase over time; 
transitions are guarded by constraints over these variables. The main problem 
of interest is the reachability problem, which asks whether a given state can be 
reached while satisfying the constraints imposed by the guards. This problem is 
known to be PSPACE-complete (already shown in [7]). The PSPACE algorithm, 
uses the so-called region-automaton construction, which essentially abstracts 
the timed automaton into an exponentially larger finite automaton of regions 
(collections of clock valuations), which is sound and complete for reachability. 

Despite this complexity-theoretic hardness, the model of timed automata has 
proved to be extremely influential and versatile, resulting in an enormous body 
of work on its theory, variants and extensions over the past 25 years. Almost since 
its inception, researchers also began to develop tools to extend from theoreti- 
cal algorithms to solve practical problems. Such tools range from the classical 
and richly featured tool UPPAAL [9,23] to the more recent open-source tool 
TChecker [19], which have been used on industry strength benchmarks and per- 
form rather well on many of them. These tools use a different algorithm for 
reachability, where reachable sets of valuations are represented as zones and 
explored in a graph. While a naive exploration of zones does not terminate, the 
algorithms used identify different strategies [8, 18,21], e.g., subsumption or simu- 
lations, extrapolations, for pruning the zone-based exploration graphs, while pre- 
serving soundness and completeness of reachability. While this does not change 
the worst case complexity, in practice, the zone exploration results in much bet- 
ter practical performance as it allows on-the-fly computation of reachable zones. 
One could even argue that the wider adoption of timed automata paradigm in 
the verification community has been a result of scalable implementations and 
tools built on this zone-based approach. 

In light of this, zone-based algorithms are often looked for to improve practi- 
cal performance of extensions of timed automata as well. For instance, for timed 
automata with diagonal constraints, classical zone-based approaches were shown 
to be unsound [11,12], but recently, an approach has been developed which 
adapts the existing construction and obtains fast zone-based algorithms [17]. 
In the present paper, we are interesting in adding a different feature to timed 
automata, namely an unbounded lifo-stack. This results in a powerful model of 
pushdown timed automata (PDTA for short), in which the source of “infinity” 
is both from real-time and the unbounded stack. Unsurprisingly, this model and 
its variants have been widely studied over the last 20 years with several old and 
recent results on decidability of reachability, related problems and their complex- 
ity, including [1—5, 10, 13-16]. A wide variety of techniques have been employed to 
solve these problems, from region-based abstractions, to using atoms and systems 
of constraints, to encoding into different logics etc. However, except for [4,5], to 
the best of our knowledge, none of the others carry an implementation. In [5], 
the implementation uses a tree-automaton implicitly based on regions and the 
focus in [4] is towards multi-pushdown systems. A common factor of all these 
works is that none of them consider zone-based abstractions. 
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In this paper, we ask whether zone-based abstractions can be used to decide 
efficiently reachability questions in PDTA. We focus on the problem of well- 
nested control-state reachability of PDTA, i.e., given a PDTA, an initial and 
a target state, does there exist a run of the PDTA that starts at the initial 
state with empty stack and reaches the target state with an empty stack (in 
between, i.e., during the run, the stack can indeed be non-empty). As with 
timed automata, our goal here is towards its applicability to build powerful 
tools which could lead to wider adoption of the PDTA model and showcase its 
utility to model-checking timed recursive systems. As the first step, we examine 
the difficulties involved in mixing zones with stacks and point out that a naive 
adaptation of the zone-based algorithm would not be sound. Then we propose 
a new algorithm that modifies the zone-based algorithm to work for pushdown 
timed automata. This is done in three steps. 


— First we view the zone-graph exploration at the heart of the zone-based reach- 
ability algorithm for timed automata as a least fixed point computation of 
two inductive rules. When applied till saturation, they compute a sound and 
complete finite abstraction of the set of all reachable zones. 

— Next, this view allows us to generalize the approach in the presence of a stack 
by adding new inductive rules that correspond to push and pop transitions, 
and hence are specific to the stack operation. There are two main technical 
difficulties in this. First, we need to ensure termination of the fixed point 
computation, using a strong enough pruning condition of the (a priori infinite) 
zone graph to ensure finiteness, while being sound and not adding spurious 
runs. Second, we want to aggressively prune the graph as much as possible to 
obtain an efficient zone-exploration algorithm. We show how we can minimally 
change the condition of pruning in the zone exploration graph to achieve this 
delicate balance. Indeed, in doing so we use a judicious combination of the 
subsumption (or simulation) relation and an equivalence relation for obtaining 
a fixed point computation for PDTA that is terminating, while being sound 
and complete. 

— Finally, we build new data structures that allow us to write an efficient algo- 
rithm that implements this fixed point computation. While getting a cor- 
rect algorithm is relatively simple, to obtain an efficient one, we must again 
encounter and overcome several technical difficulties. 


We implement our approach to build the first zone-based tool that efficiently 
solves well-nested control state reachability for PDTA. Our tool is built on top 
of existing infrastructure of TChecker [19], an open source tool and benefits 
from many existing optimizations. We perform experiments to show the practi- 
cal performance of multiple variants of our algorithm and show how our most 
optimized version is vastly better in performance than other variants and of 
course the earlier region-based approach on a suite of example benchmarks. 
We note that our PDTA model differs slightly from the model considered 
in [1,3], as there is no age on stack and time spent on stack cannot be com- 
pared with clocks. Hence our model is closer to [10,16]. However, in [13], it was 
shown that these two models are equivalent, more specifically, the stack can be 
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untimed without loss of expressivity (albeit with an exponential blowup). Thus 
our approach can be applied to the other model as well by just untiming the 
stack. There are other more powerful extensions [14,15] studied especially in the 
context of binary reachability, where only theoretical results are known. We also 
remark that the idea of combining the subsumption relation between zones with 
an equivalence relation also occurs while tackling liveness, or Buchi acceptance, 
in timed automata. This has been studied in depth [20, 22,24], where the naive 
zone-based algorithm does not work, forcing the authors to strengthen the simu- 
lation relation in different ways. Though these problems are quite different, there 
are surprising similarities in the issues faced, as explained in Sect. 3. 

The structure of the paper is as follows: we start with preliminaries and move 
on to the difficulty in using zones and simulation relations in solving reachability 
in PDTA. Then, we introduce in Sect. 4 our inductive rules for timed automata 
and PDTA and show their correctness. In Sect.5, we present our algorithm and 
helpful data-structural advancements. We show the experimental performance 
in Sect.6 and end with a brief conclusion. Proofs that are missing and more 
experimental results can be found in the long version of the paper available 
at [6]. 


2 Preliminaries 


2.1 Timed Automata 


Timed automata extend finite-state automata with a set X of (non-negative) 
real-valued variables called clocks. We let P(X) denote the set of constraints 
y that can be formed using the grammar: y :=a4~c|lx@-—y~r~c| yny, 
where z,y E X, cE N, ~ € {<,>,<,>}, where each x ~ c is called an atomic 
constraint. A clock valuation is a map v: X — Rso and is said to satisfy y, 
denoted v = y, if p evaluates to true when each clock x € X is replaced 
with v(x). For 6 € R°, we write v + 6 to denote the valuation defined as 
(v + 6)(a) = v(x) + 6 for all clocks x. For R C X, we write [R]v to denote the 
valuation obtained by resetting clocks in R, i.e., ([R]v)(2) = 0 if x € R, and 
([R]v)(x) = v(x) otherwise. Finally, vg is the valuation that sets all clocks to 0. 

A timed automaton A is a tuple (Q, X, qo, A, F), where Q is a finite set of 
states, X is a finite set of clocks, go € Q is an initial state, F C Q is the set of 
final states and A C Q x @(X) x 2% x Q is a set of transitions. A transition 
t € A is of the form (q,g,R,q’), where q,q' are states, g € P(X) is the guard 
of the transition and R C X is the set of clocks that are reset at the transition. 
The semantics of a timed automaton A is given as a transition system TS(A) 
over configurations. A configuration is a pair (q,v) where q € Q is a state and 
v is a valuation, with the initial configuration being (qo, vo). The transitions are 


of two types. First, for a configuration (q,v) and 5 € R2°, (q,v) 2 (q, v + 0) is 
a delay transition. Second, for t = (q, g, R,q') € A, (q, v) 4 (q', v’) is a discrete 
transition if v = g and vu’ = [R](v). A run is an alternating sequence of delays 
and discrete transitions starting from the initial configuration, and is said to be 
accepting if the last state in the sequence is a final state. 
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2.2 Reachability, Zones and Simulations 


The problem of control-state reachability asks whether a given timed automa- 
ton has an accepting run. This problem is known to be PSPACE-complete [7], 
originally shown via the so-called region abstraction. Note that, since TS(A) is 
infinite, some abstraction is needed to get an algorithm. In practice however, 
the abstraction used to solve reachability, e.g., in tools such as UPPAAL [23] 
or TChecker [19] is the zone abstraction. A zone Z is defined as a set of val- 
uations defined by a conjunction of atomic clock constraints. Given a guard 
g and reset R, we define the following operations on zones: time elapse Z= 
{v +ô |v € Z,6 € RZ?}, guard intersection g N Z = {v € Z | v = g} and reset 
[R]Z = {[R]v | v € Z}. The resulting sets are also zones. With this, we can define 
the zone graph ZG(A) as a transition system obtained as follows: the nodes are 


(state, zone) pairs and (q, Z) ms (q',Z'), if t = (¢,9,R,q) is a transition of A 
— > 

and Z’ = [R](g N Z). The initial node is (qo, Zo = {vo}) and a path in the zone 

graph is said to be accepting if it ends at an accepting state. The zone graph is 

known to be sound and complete for reachability, but as the graph may still be 

infinite, this does not give an algorithm for solving reachability yet. 

To obtain an algorithm, one resorts to different techniques such as extrapola- 
tion or simulation. Here we focus on simulation relations which will lead to finite 
abstractions. Given a timed automaton A, a binary relation < on configurations 
is called a simulation if whenever (q,v) < (q’,v’), we have q = q' and 


— for each delay ô € R2°, (q,v +6) < (q,v’ + 6) and 
— for each t = (q, g, R,qı) € A, if v = g then v’ H g and (q,[R]v) < (q, [R]v’). 


We often simply write v <, v’ instead of (q,v) < (q,v’). We can now lift 
this to sets Z, Z’ of valuations as Z x, Z’ if for all v € Z there exists v’ € Z’ 
such that v <4 v’. We say that node (q, Z) is subsumbed by node (q, Z’) when 


Z x, Z'. As a consequence we obtain the following lemma. 


Lemma 1. If (q,Z) > (q1,Z1) in ZG(A) and Z <4 Z', then (q, Z') > (q, Z!) 
and Zi Sq Zi- 


—— 

Proof. Indeed, let vı € Zı = [R](gM Z). We find v € Z and 6 > 0 such that 

v H g and v = [R]v +ô. Since Z <4 Z’, we find v’ € Z’ with v <4 v’. We deduce 

that v’ = g and [R]v <q [R]v’, which implies vı Sq, vi with vi = [R]v' + ô € 
—— 

Z; = [RUA Z}. 


A simulation < is said to be finite if for every sequence of nodes (q1, Z1), 
(q2,Z2),... there exist two nodes (q;, Zi) and (qj, Zj) with i < j such that 
qi = qj and Zj Sq, Zi. The importance of the finiteness is that it allows us 
to stop exploration of zones along a branch of the zone graph: when a node 
(qj, Zj) is reached which is subsumed by an earlier node (q;, Z;), we may cut 
the exploration since all control states reachable from the latter are already 
reachable from the former. For a timed automaton A, we call this pruned graph 


624 S. Akshay et al. 


as ZG(A). Thus, if the simulation relation < is finite, then ZG~(A) is finite, 
sound and complete for control state reachability. We formalize this algorithm 
in Sect. 4, using inductive rules. 

Various finite simulations have been shown to exist in the literature, including 
the famous LU-abstractions [8], and more recent G-abstractions based on sets 
of guards [17]. Hence this theory indeed has resulted in better implementations 
and is used in standard tools in this domain. 

We will see that using simulation in the context of pushdown timed automata 
is not always sound, in some cases we need a stronger condition to stop the explo- 
ration. Towards this, we consider the equivalence relation on nodes induced by 
the simulation relation: Z ~, Z’ if Z x, Z’ and Z’ <4 Z. We say that the simu- 
lation < is strongly finite if the induced equivalence relation ~ has finite index. 
Notice that strongly finite implies finite but the converse does not necessarily 
hold. Fortunately, the usual simulations for timed automata, in particular the 
LU-simulation and the G-simulation, are strongly finite. 


2.3 Pushdown Timed Automata (PDTA) 


A Pushdown Timed Automaton A is a tuple (Q, X, qo, T, A, F), where Q is a 
finite set of states, X is a finite set of clocks, gg € Q is an initial state, I" is the 
stack alphabet, F C Q is the set of final states and A is a set of transitions. A 
transition t € A is of the form (q, g,op, R,q’), where q, q’ are states, g E€ B(X) is 
the guard of the transition and R C X is the set of clocks that are reset at the 
transition, op is one of three stack operations: nop or push, or pop, for some 
ael. 

The semantics of a PDTA A is given as a transition system TS(A) over 
configurations. A configuration here is a tuple (q, v, X) where q € Q is a state, v 
is a valuation, x € I™ is the stack content, with the initial configuration being 
(qo, v0,€). The transitions are of two types. First, for a configuration (gq, v, x) 


and 6 € R2°, (q,v,x) 2 (q,v + 6,x) is a delay transition. Second, for t = 


(q,g,0p, R,q’) € A, (q,v,x) 4 (q',v', X) is a discrete transition if v = g, 
v' = |R] (v) and 


— if op = nop, then x’ = x, 
— if op = push, then y’ = x-a, 
— if op = popa, then x= y’-a. 


A run is an alternating sequence of delays and discrete actions starting from the 
initial configuration. It is accepting if the last state in the sequence is final. 
Our main focus is the well-nested control state reachability problem for 
PDTA, which asks whether a configuration (q,v,¢) with q € F is reachable, 
where the stack is empty. Later, in Sect. 7, we remark how our solution can be 
extended to solve general control state reachability, i.e., asking whether a con- 
figuration (q,v, x) with q € F is reachable, possibly with a nonempty stack x. 
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Fig. 1. A simple PDTA with 2 clocks {z,y}. Note that if we ignore the push/pop 
actions we get a TA, say A. 


push, push, push, POP, POP, 
2% qı, Zo q2, Zo q3, Zo q,(0<y—a# <3) ga, (l <y- r <4) 


Fig. 2. Zone graph with simulation edges for finiteness. Again ignoring push/pop 
actions gives us a zone graph for the TA. Zo is the initial zone. 


3 Zones in PDTA and the Problem with Simulations 


As mentioned earlier, zones are collections of clock valuations defined by conjunc- 
tions of timing constraints, and exploring zones reached by a timed automaton 
gives a sound and complete abstraction for state reachability. To make sure that 
the exploration is finite we need to prune the graph and one way this is done by 
simulation, i.e., not exploring paths from some nodes if they are “subsumed” by 
earlier nodes visited in the graph. Consider Fig. 1, in which we ignore the push, 
and pop, or we can think of them as internal actions. Then the usual zone- 
graph construction with simulation would give the graph depicted in Fig. 2. In 
this section, just for illustration we instantiate the simulation relation to be the 
well-known LU-simulation (we do not give the definition here as it is not relevant 
to what comes later, instead we refer to earlier work [8]). Using this, we obtain 
that the rightmost node is subsumed by the previous one, and hence the dotted 
simulation edge. If we did not do this we immediately observe that we get an 
infinite graph with increasing sets of zones. 

Now, our first question is whether this zone exploration with simulation can 
be lifted to PDTA. In this example, if we were to add back the push/pop edges, 
we get exactly the same Zone graph with annotations, and further, the final 
state is indeed reachable. Hence, for this particular example we do obtain a 
finite, sound and complete graph exploration. However, in general it turns out 
that the procedure is not sound. 

Consider the example in Fig. 3. In this example, again considering it as a TA 
(ignoring the push/pops), we would get the zone graph below, which would be 
finite, sound and complete for reachability in that TA. But if we consider it as a 
PDTA, now doing the same does not preserve soundness. In other words, in the 
PDTA, q3 is no longer reachable. However, in the zone graph we would conclude 
that it is reachable due to the simulation edge. If, to fix this, we remove the 
dotted simulation edge, then we will lose finiteness. 

Thus, it seems that we have a difficult situation where zones with the simula- 
tion relation, needed for termination, do not preserve soundness. This situation 
resembles the situation studied in [20,22,24], where the authors study liveness 
or Buchi-acceptance conditions in timed automata. Again in that situation, the 
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Fig. 3. A PDTA and its zone graph with simulation. With the simulation (dotted) 
edges, q3 is reachable in the zone graph, but its not reachable in the PDTA. 


naive algorithm with zone simulation does not work and the authors are forced 
to strengthen the simulation relation in different ways. 

Surprisingly, it turns out, that even in our very different problem setting of 
reachability in PDTA, a similar solution works. That is, we replace simulation by 
equivalence (defined in the previous section) as the pruning criterion. However, 
there are two issues (i) it is not easy to prove its correctness and (ii) this is far 
from efficient as shown in the experimental section. Our goal to use zones in the 
first place was efficiency and hence we would like to prune the zone graph as 
much as possible, i.e., we would like to use simulation edges as much as possible. 
In the next two sections, we describe our fix. We first show a different view of 
the exploration algorithm as a fixed point rule based approach. This allows us to 
then describe our fix in the same language, which is much easier to understand 
conceptually. Also as a corollary we will be able to show that using equivalence 
everywhere also gives a correct algorithm. After proving the correctness of our 
rule-based algorithm, we then tackle the challenges in implementing it. 


4 Viewing Reachability Algorithms Using Rewrite Rules 


In this section, our goal is to compute a set S of nodes of the zone graph of a 
PDTA, as a least fixed point of a small set of inductive rules, such that a control 
state q occurs in S, i.e., (q, Z) € S for some Z iff q is reachable in the PDTA 
from its initial state. To understand the rules and their correctness it is easier 
to first visualize this on plain timed automata without any push-pop edges. 


4.1 Rewrite Rules for Timed Automata. 


Given a TA A = (Q, X, qo, A, F), the set S containing all reachable nodes of the 
zone graph, can be obtained as the least fixed point of the following inductive 
rules, with a natural deduction style of presentation. 


start 


S := {(qo, Zo)} 
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R — 


(ZES gq Z'=R(gNZ)#0 
S:= SU 4{(q', Z} 


Let S* denote the set at the fixed point by starting with the start rule and 
repeatedly applying the trans rule. It is easy to see that this computes the set 
of all reachable nodes of the zone graph: the start rule starts with the initial 
node and each application of trans rule takes a reachable node and applies a 
transition of the automaton and includes the resulting node reached. However, 
this set S* is a priori infinite since number of zones is infinite. 

To make it finite we add a condition under which we will apply the transition 
rule based on a finite simulation relation (let us denote it <) for A. 


(26S qg2#¢ W=RGNZ) 40 
S := S U {(g', Z')}, unless A(q’, Z”) € $, Z' Xq Z" 


Trans-< 


Thus to obtain an algorithm, we would explore all nodes in the Zone graph 
using a search algorithm (say DFS/BFS) and we would add a node only if it is 
not subsumed by an already visited node, according to the simulation relation. 
We explained in Sect. 2.2 that doing this preserves soundness and completeness 
and gives a finite exploration. 


Lemma 2. Let S% denote any set obtained from the start rule and by repeatedly 
applying Trans-x till a fixed point is reached. Note that depending on the order 
of applications we may have different sets. Then we have: 


1. (finiteness) S% is finite. 
2. (soundness and completeness) For all q € Q, a configuration (q, v) is reachable 
from (qo, vo) in the TA A iff (q, Z) E€ S$ for some zone Z. 


We do not give the proof here as (i) it is only a reformulation of known results 
and (ii) it will be subsumed by the much stronger theorem we prove next. 


4.2 Rewrite Rules for PDTA 


Let A = (Q, X, qo, T, A, F) be a PDTA, we will need not just a set but a tuple 
of sets. More precisely, we maintain a set of nodes G called root nodes. For 
each root node (q, Z) € ©, we also maintain a set of nodes, denoted S(q,z). The 
intuition is that root nodes are those that can be reached after pushing a symbol 
to the stack, whereas S(4.z) will be the set of nodes that can be reached from 
(q,Z) with a well-nested run, i.e., starting with an empty stack and ending in 
an empty stack. This is to avoid storing the stack contents in our algorithm, 
which would be another source of infinity. Again, we use simulations to make 
the computation finite. So we fix a strongly finite simulation relation < for A. 
Our inductive rules for the control state reachability of pushdown timed 
automata are given in Table1l. Note that the internal rule is just the same as 
for timed automata above. The start rule not only starts the set of nodes com- 
putation but also the set of roots computation as described above. So the only 
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Table 1. Inductive rules for control state reachability of PDTA 


Start 
G := {(qo, Zo)}, Sigo, Zo) = {(qo, Zo)} 
z 1 g 1 gmop,R y " 7 
(4, Z)E 6 (d, Z’) € Sqq,z) fg Z = RUN BIE O o , , 
= nterna. 
Siaz) = S¢q,z) U{(q", Z”)}, unless 3(g”, Z”) € Siaz) Z” Sarn Z” 
ush,, Se ae 
(ZES (d, Z)eSagn d Ag Zh =RonZ) 46 : 
Pus 


6 := 6 U {(g", Z”)}, Sinz) = {(q", Z”)}, unless 3(q", Z”) € 6, Z” wgn Z" 


, j 1 DPUShaR y z" — R( N Z') a Z 
(q, Z) EG (q a) € Siq, Z) q q _ g geal 
91;POP,_, Ri 


(AES ,Z)ES f Zo =RilginZ) #0 
q', 41) € (ai, 21) E Sariz) a "> a 2 = Rı(gı N Z1) Æ 


= Pop 
S(q,Z) += Sia, z) U {(q2, Z2)}, unless 3(q2, Z2) € S(q,z), Z2 Saa Z2 


interesting rules are the Push and Pop rules. The push rule says that when a 
push is encountered, then we must start exploring from a new root (i.e., context). 
So the only complicated rule is the Pop rule. Here the intuition is that if we see 
a push at a node and from a root equivalent to the root created from it, (i.e., its 
context) we see a matching pop reaching a new node, then this push-pop context 
is complete, and we can add this new node to the set of reachable nodes. This 
is precisely the point where we need equivalence rather than simulation and this 
will be made clear in the proof of the theorem below. 


Theorem 1. Let G* and (Siq z))(q,zjco» denote any tuple of sets obtained from 
the start rule and by repeatedly applying the rules in Table 1 till a fixed point is 
reached!. Note that we always have (qo, Zo) € G*. The following statements hold: 


1. (finiteness) G* is finite and for each (q, Z) E€ G*, S(q,z) ts finite. 

2. (completeness) For each (q, Z) € ©*, if there exists a run (q,v,€) S (q',v',€) 
of A with {v} <4 Z, then there exists (q', Z') E€ Siq, z) such that {u'} Sy Z’. 

3. (soundness) For each (q, Z) € ©*, (q',Z') € Siqa z) and v’ € Z', there exists 
a run in PDTA from (q,v,£) to (q',v",£) with v € Z and v <y v”. 


Proof. 1. Note that only the Push rule creates new root nodes and the red 
condition states that a new root node is added only if there isn’t already an 
equivalent node in G*. Since the simulation relation is strongly finite, the set 
of roots G* must be finite. Also, before adding a node to some Sqq,z) with the 
internal rule or the pop rule, we check that the node is not subsumed by an 
existing one. Since the simulation relation is finite, this ensures that each set 
S(q,Z) is finite. 


1 As before, there could be several such sets depending on the order in which the rules 
are applied. 
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(qve) > (qavs) > (itv) S (dn-1, Un-1, 4) 5 (dn, Un, €) 
vZ Vi Lq; Zi Vit Ray. Zit1 Un-1 an-ı Zn-1 Un Xan Zn 
(q,Z)€6 (qi, Zi) E€ S(g,z) Zinn = RGN Zi (Qn-1, Zn-1) € Sais1 244) Zn = Ri(giN Znr-1 
Ziti “aigi Zig Zn San Zn 
(qi+1, Zi41) € S (dn, Zn) € S¢q,z) 


Fig. 4. Construction for the completeness-push-pop last sub-case. 


2. Let (q, Z) € G* and assume that (q’,v’,e) is reachable from some (q,v,€) 
with v <q Z, i.e., there exists a run (q,v,€) = (1,1, X1) > +++ — (Gn, Un: Xn) = 
(q', v’,€). We will then show that Un Xq,, Z’ for some (qn, Z") € S(q,z). The proof 
is by induction on n. Base case: For n = 1 we have q’ = q and v’ = v. The result 
is obtained by taking Z’ = Z. Notice that (q, Z) € S(q,z) follows immediately 
from the start rule if q = qo, Z = Zp or from the push-create rule. 

Let us then assume that the statement holds for runs of length at most n— 1. 
Consider any run of the form (q,v,¢) = (@1,01,X1) > ++: > (qn, Un, Xn = €) 
with v <4 Z. Notice that its last transition (qn—1,Un—1,Xn—1) — (Gn; Un: Xn) 
cannot be a push transition (in the PDTA) since Xn = £. Hence, we have three 
subcases, depending on the last transition. 


— Time elapse. ¥n—1 = Xn = €, Qn-1 = qn = U5 Un = Un_-1 +6 for some 6 € R2°. 
Applying induction hypothesis, we have vj_1 Sg Z’ for some (q', Z") € S(q,z)- 


ag 
Since zones are closed under time elapse, we get Z’ = Z’ and by definition of 
= 


the simulation relation Vn = U,;_1 + ô <y Z’ = Z’. This completes the case. 


è n bi è „nop, R 
— Discrete internal transition. In this case Xn-1 = Xn = €, t = qn-1 LIT, An, 


Un—1 = g and vn = |R]vn-1. Then applying induction hypothesis, there exists 
r 
(dan-1, Z’) € Siq, z) such that Un—1 Xq,_, Z’. Now let Z” = R(g N Z’). From 
the definition of the simulation relation we get Un <,, Z”. Then, applying 
the Internal rule, there exists (qn, Z”) € S(q,z) such that Z” <q, Z, with 
possibly Z” = Z”. Hence, Vn <q, Z” <q, Z”, which completes the case. 

— Pop transition. Then there exists 1 < į < n — 1 such that the run has the 
form: (q1,01,€) > --- > (qi; Vi, Xi = €) a, (qi+1; Vita, Xi+1 = A)... 
(dn—1;Un—1;Xn-1 = a) ŒS (dn;Un;Xn = £), where the push and pop are 
matching transitions, i.e., |yv;| > 1 for alli < j < n — 1 (see Fig. 4). Then by 
induction hypothesis at 7, we have 


Vi Xq; Zi for some (qi, Zi) E€ S(q,z) - (1) 
From the push transition we have 
= „push, R : 
J = qi ET, qiqi € A with v; & g and vipi = [R]v;. (2) 


E .jo . . . 
Let Zi41 = R(gN Z;). By definition of the simulation relation, we deduce 
from vi Sq; Zi that vita Xq,,, Zi41. We can apply the Push rule to obtain 
(qi+1, Zi+1) € G* for some Ziyi ~Nqi+1 Ziyi (3) 


possibly with Z/,, = Zj41 as a special case. 
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Further the segment of run (qi+1, Vi+1,@) > ---(Qn—1, Un—1,@) in the PDTA 
never pops the symbol a (by choice, since otherwise the push and pop would 
not be matching). Hence we will also have the same sequence of transi- 
tions forming a run (qi+1, Vi+1,£€) — ---(Qn—1,Un—1,€). Using Vii Xa, 


Ziyi ~qiı Zipi we deduce that vi+ı =, 41 Zj41- By induction hypothe- 
sis, 


Un—-1 Sang Zn al for some (dn— 1, Zn— 1) = AEN A (4) 
Finally, we have the pop transition 


„Popa, R i 
ti = qn-1 dalai qn € A with vn-1 E gi and vn = [Rijvn-1. (5) 


We let Zn = Rı(gı N Zn-1). From vn—1 Xq,_, Zn—1 and the definition of 
the simulation relation we obtain Un Sg, Zn. Then, combining all the above 
equations (1-5), and applying the Pop-rule we obtain some (qn, Zn) € S(q,z) 
with Zn <q, Zn (possibly Z} = Zn). Finally we get Un Sg, Zn Xq, Zn- This 
completes the proof 


3. We will show that the following property is invariant by rule applications: 


V(q, Z) € G, V(q',Z’) € Siq,z),Vu' € Z’, there is a run (Inv) 
(q,v,€) Š (q',v",€) with v € Z and v' <q v" 


The invariant holds initially, i.e., after application of the start rule. Indeed, in this 
case we have 6 = { (qo, Zo)} and S(q,z,) = {(q0, Zo) }. Hence (¢’, ae = (q, Z) = 


(qo, Zo) and for all v € Zo we can choose the empty run (qo, v ey. (qo, v, £). 

We show below that (Inv) is preserved by application of an internal/push/pop 
rule. Therefore, the invariant still holds when reaching the fixed point, which 
proves the soundness. Let us write G~ and Sa for the sets before the appli- 
cation of the rule and G and S(q,z) for he a after the application of the 
rule. 


Internal Rule. Let (q, Z) € © = 67, (q’, 2’) E€ Siq.z) and v' € Z’. Tf (q’, Z’) € 
S (4.2) then we get the result since (Inv) holds before applying the internal rule. 


; ; = P „nop, R 
Otherwise, there is some (q1, Z1) € S(q,z) and a transition t = qı ae g 


—— 
with Z’ = R(g N Z1). 
By definition, there exists vı € Z, such that vı = g and v’ = [R]v, + ô for 


some 6 > 0. Hence we have a run (qi, v1,€) ora (g, v’, £). Since the invariant 
holds before the internal rule, there is a run (q, v,€) S (q1, v1,€) with v € Z and 
vı Xq, Vi. Now since < is a simulation we obtain that (q1, v1,€) 12, (q, v”, £) 
with v’ <q v” and we are done. 


Push Rule. Let (q, Z) € ©, (q', Z') € Siq, z) and v’ € Z’. If (q, Z) € ©- then 
we get the result since (Inv) holds before applying the Push rule. tlorise, we 


must have (q', Z’) = (q, Z) and we can choose the empty run (q, v’ eo (q,v’',€). 
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tr. o 
(qi, U4, @) 15, (q2,0',€) 


IA 
(q", U3, a) + (di, U4; a) 
IA IA 
(q',v2,€) “>> (q"", vs, a) IA 


IX IX 
1 


/ 
(q,v,€) È (q, uh, E) S5 (q’, vf, a) S (ai, vf, a) “+ 


=> (q2,0",€) 


Fig. 5. Construction for the soundness. 


Pop Rule. Let (q, Z) € © = G~, (q2,Z2) E€ S(q,z) and v' € Z2. Again, if 
(q2, Z2) E So, z) then we get the result since (Inv) holds before applying the Pop 
rule. Otherwise, by definition of the pop rule we have: 


some (q', Z’) € S(q,z), 


o. g,push, ,R 
some push transition t = f ———*> q”, 


nar’ 
some (q”, Z1) € © with Zı ~ar Z” = R(g A Z’), 
some (q1, Zi) € S”, Za)» 


oO eN pH 


sya 1 91POPa Ri 
some pop transition tı = q), ————— Q2, 


er A . . . 
with Z = Rı(gı N Zi). The construction below is illustrated in Fig. 5. 
Since v’ € Z2, we get some v4 € Z; such that v4 } gı and v’ = [Ri]v4 + 6" 


6’ 
for some 6’ > 0. Hence we have a run (qj,14,a) “+> (q2, 0’, €). 

Now, applying the invariant to (q”, Z1) € G, (qi, Z1) E€ S(qv,z,) and va € Zi, 
we get a run (q’,v3,€) > (q/,,04,€) with v3 € Zı and v4 Xq v4. Hence, we also 
have a run (q”,v3,a) S (q",v4, a). 

EA ON 

Let vg E€ Z” = R(GNZ’) ~ar Zı with v3 <q” vy. we get some v2 € Z’ 

such that vg — g and vs = [R]vg + 6 for some ô > 0. Hence we have a run 


, t ô M a 
(q » V2, £) te (q > U33 a). 

Finally, we apply the invariant to (q, Z) € ©, (q', 2") € S, z) and v2 € Z’, 
we get a run (q,v,€) Š (q',vh,£) with v € Z and v2 <q vh. 

By repeatedly applying the property of simulation <, we may extend the 
run from (q',v5,€) with (q',vh,¢) > (g, vf,a) > (giota) 55 (q,ue) 
where v3 <q” V3 Xqv vg and v4 Za; U4 Za vi. Finally v’ <q, v”. Therefore, the 
invariant holds after the pop rule. 


5 Algorithm for PDTA Reachability via Zones 


In this section, we describe Algorithm 1 implementing the fixed point computa- 
tion defined by the inductive rules in Table 1. We describe the structure of the 
algorithm and its main data-structures. 
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Notice first that the sets G and S(q,z) for (q, Z) € © can be alternatively 
represented as a single set of pairs of nodes: 


S = {[(¢, Z), (d, Z] | (4, Z) € G and (q', Z’) € S(q,z)}- 


We can recover © as the first projection of S and S(q,z) as the second projec- 
tion of S filtered by the first component being (q, Z). We use both notations 
below depending on which is more convenient. The start rule initializes S to 
{[(qo, Zo), (qo; Zo)]}- 

Let us consider first the rule for internal transitions. For each already dis- 
covered pair of nodes [(q, Z), (q', Z’)] € S (or (q’, Z') € S, z) with (q, Z) € 6), 


: x : ae g, nop, R 
we have to consider each possible internal transition f ————> q” and check 
r 


whether the node (g”, Z”) with Z” = R(g N Z’) should be added to S(q,z) or is 
subsumed by an existing node. This is like a graph traversal. The set S stores 
the already discovered pairs of nodes, and we will use a ToDo (unordered) list 
to store the newly discovered nodes from which outgoing transitions should be 
considered. The ToDo list should also consist of pairs [(q, Z), (q', Z')] so that 
when a new node (q”, Z”) is discovered by an internal transition from (q’, Z’) 
we know to which set S(g,z) it should be added. 

As we can see from Theorem 1-soundness, given (q, Z) € ©, the set S(q,z) 


should consist of nodes reachable from (q, Z) via a well-nested run. Hence, when 


„push, ,R 
dealing with a pair [(q, Z), (q', Z’)] € S and we see a push transition q’ = 
—— > 


q” with Z” = R(g N Z’), we should not try to add the pair (q", Z”) to S(q,z) since 
the corresponding run would not be well-nested. Instead, we should search for 
a matching pop transition which could be taken after a well-nested run starting 
from (q”, Z"). This is why the push rule adds the new root (q”, Z”) to G (unless 
it is equivalent to an existing root). The pair of nodes [(q”, Z”), (q”, Z”)] is newly 
discovered and added to the ToDo list for further exploration. 

The push transition may be matched with several pop transitions (which 
could be already discovered or yet to be discovered by the algorithm). To avoid 
revisiting the push transition many times, it will be stored by the algorithm 
in an additional set Spusn. More precisely, we will store in Spush the tuple 
[(q, Z), a, (q”, Z”)] meaning that the root node (q”, Z”) may be reached from 
the root node (q, Z) via a well-nested run reaching some (q’, Z’) followed by a 
transition pushing a onto the stack. 

Finally, assume that, when dealing with a pair [(q1, Z1), (q1, Z1)] E€ S, we see 


? ak bs . 
a pop transition q4 SepePa "go with Z2 = Ry (gı O Zi). We will check whether 


it can be matched with an already visited push transition, stored in the set Spush 
as a pair [(q, Z), a, (q”, Z”)| with (q”, Z”) = (q1, Z1). If this is the case, the pop 
rule may be applied and the node (q2, Z2) added to S(g,z) (unless it is subsumed 
by an existing node). The newly discovered pair of nodes [(q, Z), (q2, Z2)] is also 
added to the ToDo list for further exploration. Once again, the pop transition 
may also be matched with push transitions that will be discovered later by the 
algorithm. To avoid revisiting the pop transition many times, we store the tuple 
[(q1, 21), a, (q2, Z2)] in a new set Spop. 
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Data Structures. We use a data structure TLM to store the triple of sets 
(S, Spush; Spop) and which is accessed with the following methods. 


— TLM.create() creates the data structure with the three sets empty. 

— TLM.add(q, Z,q’, Z’) adds [(q, Z), (q', Z')] to S. 

— TLM.addPush(q, Z, a,q’, Z’) adds [(q, Z), a, (q', Z’)] to Spush. 

— TLM.addPop(q, Z, a, q', Z’) adds [(q, Z), a, (q’, Z’)] to Spop. 

— TLM.isNewRoot(q, Z) returns [false, Z’] if there exists some (q, Z’) € G with 
Z' ~q Z, and returns [true, Z] otherwise. 

— TLM.isNewNode(q, Z, q', Z’) returns false if S[(q, Z), (q, Z”)] € S with Z’ <q 
Z", and returns true otherwise. 

— TLM.isNewPop(q, Z, a,q', Z’) returns false if 3|(q, Z), a, (q', Z”)] E€ Spop with 
Z' <q Z", true otherwise. 

— TLM.isNewPush(q, Z,a, q’,Z’) returns false if [(q, Z), a, (q', Z’)] © Spush, and 
returns true otherwise. 

— TLM.iterPop(q, Z,a) returns the list of (q', Z’) with [(¢, Z), a, (q’,Z')] E€ Spop- 

— TLM.iterPush(a, q’, Z’) returns the list of (q, Z), s-t. [(¢, Z), a, (q', Z’)] © Spush- 


Concretely, the data structure should store sets of nodes (q, Z) and be able 
to search or iterate through such sets. In order to make the algorithm slightly 
faster, we will segregate our sets of nodes, with the name of the state. We will 
use a hashmap in order to accomplish this task. See Fig.6 where the concrete 
data structure is depicted. 

We will use a first level hashmap to store the set of roots G. To implement 
TLM.isNewNode(q, Z, q', Z’), we first search for (q, Z) in the first level map, then 
a pointer TLM[q][Z][0] will lead to a second level hashmap for the set of nodes 
S(q,z) and we search for (q’, Z’) in this second level map. See Fig. 6(b). 

To implement TLM.isNewPop(q, Z,a,q’, Z’) and TLM.iterPop(gq, Z,a), we first 
search the root node (q, Z) in the first level map, then a pointer TLM[q][Z] [2] 
will lead to a second level hashmap storing the set of triples (a, q’, Z’) such that 
[(q, Z), a, (q’, Z’)] E€ Spop. To speed up the access, this second level pop map is 
segregated first on the key a, then on the key q’ to get the list of corresponding 
zones Z’. See Fig. 6(c,d). 

Finally, we also store the set Spush to implement TLM. 
isNewPush(q, Z, a, q’, Z’) and TLM.iterPush(a, q’, Z’). Notice that Spush consists 
of triples [(q, Z), a, (q', Z’)] where both (q, Z) and (q', Z’) are root nodes from 
G. Notice also that for the iteration we fix the second node (q', Z’). To get an 
efficient implementation, we first search the root node (q’, Z’) in the first level 
map, then a pointer TLM[q’][Z’][1] will lead to a second level hashmap storing 
the set of triples (a,q,Z) such that [(q, Z), a, (q', Z’)] E€ Spush. To speed up the 
access, this second level push map is segregated first on the key a, then on the 
key q to get the list of corresponding zones Z. See Fig. 6(c,d). 
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Algorithm 1. PDTA Reachability Using Zones. 


1: procedure PDTAREACH 


36: 
37: 
38: 


TLM.create() 
TLM.add(qo, Zo, qo, Zo) 
ToDo = {[(qo, Zo); (qo, Zo)|t 
while ToDo 4 0 do 
[(q, Z), (q’, Z’)] = ToDo.get() 


g,0p,R > Ts 
for t = q => q" and Z” = R(gNZ') #0 do 
if op = nop A TLM.isNewNode(q, Z, q”, Z”) then 


TLM.add(q, Z, q”, Z") 
ToDo.add(((a, Z), (a, Z"))) 
else if op = push, then 

lisNew, Zı] = TLM.isNewRoot(q”, Z”) 

if isNew == true then 
TLM.add(q", Z", q", Z") 
ToDo.add([(4", Z”), (q", 2”) 

end if 

if TLM.isNewPush(q, Z, a, q”, Z1) then 
TLM.addPush(q, Z, a, q”, Z1) 
for (q2, Z2) in TLM.iterPop(q”, Z1, a) do 

if TLM.isNewNode(q, Z, q2, Z2) then 
TLM.add(q, Z, q2, Z2) 
ToDo.add([(q, Z), (q2, Z2)]) 
end if 

end for 

end if 

else if op = pop, then 

if TLM.isNewPop(q, Z, a, q”, Z") then 
TLM.addPop(q, Z, a, q”, Z”) 
for (q3, Z3) in TLM.iterPush(a, q, Z) do 


if TLM.isNewNode(q3, Z3, q”, Z”) then 


> Start Rule 


> (q, Z) EGA (q', 2’) € S(q,Z) 


> Internal Rule 


> Push Rule 


> Pop Rule 


TLM.add(qs, Z3,q”, Z”) >œ Pop Rule with q = q3, Z = Z3 


ToDo.add([(q, Za), (a, Z") 
end if 
end for 
end if 
end if 


end for 
end while 


39: end procedure 


> q2 = q”, Z2 = VAL 


sets (S,Spush, Spop) defined by: 


S = {[(q,2Z), (a, Z9] | (g, Z’) € TLM[ql[Z 
] | (a,4, Z) € TLM[q 
")] | (a,q', Z’) € TLM[q 


Spush = {[(q, Z), a (q', 


| 
’ J 
Spop = {[(4; Z), a, (0, 2’) 


Z 
Z 


We now show correctness of Algorithm 1. Note that TLM encodes a triple of 


ziu 
[2][2]} 
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a Hl |p|] Za 

q2 >l- Z2 

8 > Z31 P Z32 > 233 

qa Hj- Za P Zaz P| Zas 

q5 >|: Z51 P Z52 | | i 253 

qe j>- Ze1 P|: Ze2 |- I | Zea 
(a) First level map constructed using equivalence ~, 
for controlling size. Keys will be state names, values 
will be lists of quadruplets, each of which has four 


pointers to second level maps, second level pushes 
maps, second level pops maps, and zones. 


ay m 


a3 -——— 


a5 /#—— 


(c) Pushes/Pops map cor- 
responding to root node 
(qı, Z1). Each pointer points 
to a different map where 
(q, Z) are stored. 
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q -— >| Z 


q3 | ——_>] Z3 


q4 >| Za2 >| Za5 


qe = Zeı 


(b) Second level map corre- 
sponding to S(q, ,z,). Each first 
level map node will have its 
own second level map. 


q -— >| Z 


q2 | —_>| Zo 


qa >| Za2 >| Z43 


qs —— 253 


qe ——> Ze61 


(d) For pushes/pops map, this 
is a map corresponding to root 
node (qi, Z1), and symbol az 
(say). The (q, Z) stored here is 
constructed using equivalence 
(pushes map), or using simu- 
lation (pops map). 


Fig. 6. Two level map implementing the data structure TLM storing the sets S, Spush, 


Spop- 
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Recall also the correspondence explained at beginning of Sect.5 between a set 
S of pairs of nodes, and the set of roots G together with the sets of nodes S(q,z) 
for (q,Z) € 6. 


Theorem 2. The set S encoded by TLM computed by Algorithm 1 is a fixed 
point obtained starting from the empty set by applying the inductive rules in 
Table 1. Therefore, Algorithm 1 terminates and is sound and complete for well- 
nested control state reachability of pushdown timed automata. 


Proof (sketch). 


1. For termination, if we look at our algorithm, we can clearly see that before 
adding a pair of nodes to the ToDo list, we add the same pair to S with 
TLM.add, and before that, we always check whether the pair is already in our 
TLM or not (isNewNode or isNewRoot). Since the size of the TLM is always 
bounded because we check either the first level map or the second level map 
before adding, the outer while loop will be called only a finite number of 
times. From this we can conclude that the algorithm will terminate. 

2. For soundness we have prove that any change to the TLM is equivalent to 
applying one of the rewrite rules to (S,Spush,Spop), which is already known 
to be sound from Theorem 1. The changes to the TLM occur in lines 3, 9, 14, 
21, 31. Since line 3 simply adds [(qo, Zo), (go, Zo)] to S, it simulates the start 
rule. For line 9, we can see that the pre conditions of internal rule Table 1 
are met, with (q, Z) € ©, (q’,Z’) € S(q,z), the ifstatement (just above the 
line) stating that there is an nop transition from q to q', and Z” Æ ¢. Using 
all these we can see that indeed the operation can be performed. Similar 
arguments can be made for line 14, which simulates the push rule, and line 
numbers 21, 31 both for the pop-rule. 

3. For completeness we have to prove that after termination of the algorithm, 
using (S, Spush; Spop) to encode TLM, we cannot use any of the rules men- 
tioned in Table 1, to add anything extra to the TLM. Then from Theorem 1- 
completeness we can conclude. For the start rule we can simply say that it 
was definitely executed (Line 3), so it cannot be executed again. For the inter- 
nal rule we argue that if it can be applied after termination, then it should 
have been applied during execution. Since all transitions are considered in 
the for-loop, and the conditions before line 9 checks all the preconditions of 
the internal rule, it is certainly the case that a node (q”, Z”) could not be 
added because either it was already added, or (q”, Z”) € S(q,z),Z" Xqv Z". 
The argument for the push rule is similar. For the pop rule to be applied we 
argue that there must be a push transition and a pop transition satisfying 
the pre-conditions in the pop rule. Since both of these are already present 
for zones in the TLM, we say that they must have been added to Spush and 
Spop- We then concern ourselves with the order, arguing that if the push 
transition was discovered later the node either must already have been added 
(Line 21) or another node simulating the node must have been present in the 
TLM (Line 20). A similar argument is made in case the order of discovery is 
reversed. 


For the full proof details, we refer the reader to the long version [6]. 
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6 Experiments and Results 


Implementation We build on the existing architecture of an open-source tool for 
analysis of timed automata, TChecker [19]. Our tool along with the benchmarks 
we used is available at https: //github.com/karthik-314/PDTA reachability and 
more details can be found [6]. The input for our implementation are PDTA, 
rather than TA so we modify TChecker in order to run our experiments. While 
most of the TChecker file format will remain the same, the only place where we 
make a change to the syntax of the input, will be the edges. TChecker uses the 
following format, for its transitions, 


edge: <Process>:<src>:<tgt>:<label>{ 
do:<Reseti(x=0)> ; <Reset2(y=0)> : 
provided: <guard1(x==0)> && <guard2(y>=1)>} 


The new format in order to incorporate the pushes and pops will be, 


edge: <Process>:<src>:<tgt>:<label>{ 
do:<Reseti(x=0)> ; <Reset2(y=0)> : 
provided: <guard1(x==0)> && <guard2(y>=1)>} 
[<push/pop>:<symbol>] 


In case the operation is nop, the square brackets are left empty. 

We have implemented two variants of Algorithm 1 for PDTA and we will 
compare these between each other and also with a region-based approach. More 
precisely, we consider the following 3 algorithms: 


— Simulation Based Approach (<;y): Direct implementation of Algo- 
rithm 1. 

— Equivalence Based Approach (~zuy): This is a variation of Algorithm 1, 
with two methods changed, 

e TLM.isNewNode(q, Z,q’,Z’): Returns false if J|(q, Z), (q’,Z”)] € S with 
Z' ~g Z", and true otherwise. 
e TLM.isNewPop(q, Z,a,q', Z’): Returns false if 3|(q, Z), a, (q', Z”)] € Spop 
with Z’ ~y Z”, and true otherwise. 
As mentioned in Sect. 4, if instead of simulation, we just use equivalence 
everywhere, we do obtain a correct algorithm for reachability in PDTA. Hence 
it is interesting to compare it with the above approach. 

— Region Based Implementation (RB): A previous implementation [5], 
uses a region based approach in order to solve the non-emptiness problem in 
PDTA. We note two features of the algorithm. First, it uses a tree-automaton 
based approach for efficiency and correctness, but underlying it is the region 
(rather than zone) construction. Second, it works only with closed guards, 
while our approach works with closed and open guards. 


We note the following important points regarding our implementation: 


1. The < used in our implementation will be <zuy [8], without extrapolation 
and with global clock bounds. 
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2. The ToDo list used currently uses LIFO (stack) ordering for popping of ele- 
ments. This corresponds to a DFS exploration of the zone-graph. But we can 
use other data structures for this purpose as well, e.g., changing it to FIFO 
would give us a BFS exploration etc. 

3. Both the simulation based and equivalence based approach are tested on 
PDTA with empty and non-empty languages, but we have ensured that both 
of them return an answer only after the entire exploration has been completed. 
In other words, we do not stop the exploration when we reach a final state. 
This is to make fair comparisons, where we do not terminate because of being 
“lucky” in encountering the final state early. Of course in practice we would 
not do this. In contrast, we note that the RB approach is an on the fly 
approach which returns non-empty as soon as the final state turns out to be 
reachable. 


All experiments are run on Intel-i5 10th Generation processor, with an 8GB 
RAM, with a timeout of 120 seconds. 


Benchmarks. We used a total of 10 benchmarks in our experiments, but parame- 
terized several of them in order to test the scalability and to give us more insight 
into performance comparisons. The benchmark and their parameterizations are 
explained in [6]. We highlight only some salient points here. The benchmark B, 
is the PDTA from Fig. 1. Bo(k) is directly adapted from Fig. 3 with the constant 
y < 1 parametrized to y < k, and k +1 pops between qo and q2. Note that 
q3 is unreachable regardless of the value of k. Benchmarks B3, B4 are adapted 
from [5] with B3 involving untiming of a stack age into normal clocks. B5, Bg 
involve significant interplay of push/pop edges and clocks and Bg, By also have 
open guards. More details can be found in [6]. We also note that automata By, 
B3(3,4), Bs (ki, k2), Bg, Bo(ki, k2) accept a nonempty language, while the rest 
are empty. As described earlier this does not change the performance of the 
simulation and equivalence based approaches, but may significantly change the 
performance of the Region Based Approach. 


Results Table2 contains a selection of our experimental results; more can be 
found in [6]. From the table, we conclude first that the zone based approach 
is indeed faster than the Region Based Approach for all examples. Second, the 
simulation based approach runs faster than the equivalence based approach for 
all examples if the ToDo priority for removal remains the same. In fact, the 
performance of the simulation based approach depends mostly on the size of the 
PDTA, but the equivalence based approach is dependant on the constants used 
in guards as well, which is even more the case for the region based approach. 
Finally, our approach can easily handle closed and open guards. 

Most of the timeouts that occurred during the experiments are due to Out of 
Memory (OoM) kills, especially for larger sized PDTAs. For smaller sized PDTA 
such as B2(100), the recorded number of nodes before timeout was 154700. 

Regarding the performance, we would like to emphasize that B1, B2, B3, B4, 
B7 were designed to compare the Zone approach to the region (RB) approach. 
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Table 2. Results on the Benchmarks. Time recorded in ms, and timeout (T.O.) used 
is 120s. OoM stands for Out of Memory kill. Results rounded up to 1 decimal. # nodes 
refers to the number of nodes in the zone/region graph explored. In case of timeout 
> n, refers to recorded number of nodes n before timeout occurred. NA in RB columns 
represents that the region based approach does not handle open guards in transitions 
(Be, B7 have open guards.) 


Benchmark SLU ZLU ~LU ~LU RB RB 
Time | # nodes | Time # nodes | Time | # nodes 

Bı 0.2 17 0.2 17 | 235.6 |4100 

B2(10) 0.8 77 0.8 77 | 6835.8 | 30200 

B2(100) 20.0} 5252 20.7 5252 | T.O > 154700 

B3(4,3 0.2 6 0.2 6 | 1043.8 | 14300 

B3(3, 4) 0.2 9 0.2 9 98.8 | 3400 

Ba 0.2 8 0.1 8 0.3 17 

B;(100, 10) 0.8) 202 5.4| 2212 OoM |OoM 

Bs(100, 1000) 0.7 202 3564.3 | 201202 OoM |OoM 

Bs (5000, 100) 23.2 | 10002 3429.3 | 1010102 |OoM |OoM 

Bo (5, 4, 1000) 0.3 30 611.8 30047 | NA NA 

Be(5, 4, 10000) 0.3 30 60271.9} 300047 | NA NA 

Be (501, 500,100) | 38.2) 3006 501.0 34799 NA NA 

B7 112.4| 4475 113.1 4475 NA NA 

As a consequence these models are very simple and the number of explored nodes 


remains almost the same regardless of whether we use ~ or < to prune, which 
reflects in the times/sizes not being too different. However, the other examples 
B5, B6 are more complex and have nodes that get pruned during exploration 
(both using ~ and <). Here we can see the clear improvement of < over ~ both 
in terms of time taken and also of number of explored nodes. 


7 Discussion and Future Work 


In this paper, we examined how an unbounded stack can be integrated seamlessly 
with zone-abstractions in timed automata. We would like to point out that two 
easy extensions of our work are possible. First, as remarked earlier, our algorithm 
checks for well-nested reachability, i.e., it requires to reach a final state with 
empty stack for acceptance. But we can generalize this to general control-state 
reachability by showing that a control state q is reachable in the PDTA (with 
possibly a non-empty stack) iff some node (q, Z) is discovered by our algorithm 
and added to some S(q,z7) (and not just to S(q,,z,) as in the well-nested case). 
While this idea is simple and requires only minor edits to the existing algorithm, 
the proof of correctness requires more work and we leave this for future work. 
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Secondly, we can handle the model with ages in stack as in [1,3] with an 
exponential blowup (thanks to [13]). However, an open question is whether this 
blowup can be avoided in practice. As noted earlier, there exist extensions [14, 15] 
studied especially in the context of binary reachability, which are expressively 
strictly more powerful, for which decidability results are known. It would be 
interesting to see how we can extend the zone-based approach to those models. 

Finally, it seems interesting to examine further the link to the liveness prob- 
lem, possibly allowing us to transfer ideas and obtain faster implementations. 
Another possibility would be to use the extrapolation operator (rather than, or 
in addition to, simulation), which we have not considered in this work. 
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Abstract. We have completed machine-assisted proofs of two highly- 
optimized cryptographic primitives, AES-256-GCM and SHA-384. We 
have verified that the implementations of these primitives, written in a 
mix of C and x86 assembly, are memory safe and functionally correct, 
by which we mean input-output equivalent to their algorithmic specifica- 
tions. Our proofs were completed using SAW, a bounded cryptographic 
verification tool which we have extended to handle embedded x86. The 
code we have verified comes from AWS LibCrypto. This code is identical 
to BoringSSL and very similar to OpenSSL, from which it ultimately 
derives. We believe we are the first to formally verify these implementa- 
tions, which protect the security of nearly everybody on the internet. 


Keywords: Cryptography - Automated reasoning - Verification 


1 Introduction 


Widely-used cryptographic libraries such as OpenSSL [20], BoringSSL [16], and 
AWS LibCrypto [2] are an enticing target for formal verification. These libraries 
are used, to a first approximation, by everybody—or at least the four billion 
or so worldwide users of the internet. Each primitive in these libraries typically 
consists of a modest amount of code, but these primitives loom large in both their 
security and performance impact. Cryptographic primitives are also unusual in 
that they have clearly defined specifications and very few dependencies, which 
removes some major challenges from general-purpose verification. As a result, in 
recent years many efforts have been made to verify cryptographic library code. 

However, despite significant progress, widely-used cryptographic libraries 
have resisted verification, at least for the versions of the primitives that are 
used in practice. This is because these primitives are some of the most heavily 
optimized pieces of code in existence. For a cloud service, every packet involves 
a call to at least one cryptographic primitive, so even small optimizations will 
have large performance and cost impacts. As a result, for AES and SHA there 
© The Author(s) 2021 
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is an enormous gap in complexity between simple and easily verified high-level 
reference implementations, and the highly optimized implementations used in 
production. 

Optimizations create several difficulties when verifying cryptographic prim- 
itives. First, primitives are typically written in a mix of C and assembly. This 
means that a verification tool must model both of these languages and the man- 
ner in which they can interact. Furthermore, each optimization step inherently 
increases the difficulty of verification, because each requires one or more the- 
orems showing that the optimization is sound. To add to this, many of these 
optimizations break the abstractions used in algorithm specifications. For exam- 
ple, the SHA-384 specification is defined using a function called SigmaO that 
is unfolded and rearranged during the optimisation process (see Subsect. 6.1). 
Solver-based automation typically struggles to recover these abstractions. 

The verification of cryptographic code has seen huge advances in recent years. 
Purpose-built libraries such as EverCrypt [21] can now match the performance of 
hand-tuned OpenSSL. These correct-by-construction libraries may be the future, 
but as of 2021 they have not yet seen wide mainstream adoption. Our aim as 
formal methods practitioners is to verify the cryptographic code on which users 
depend. What has been missing until now is the ability to verify the legacy 
cryptographic code that runs in production for hundreds of millions of users. 
This is the problem we solve. 


Approach and Results. We have formally verified the memory safety and func- 
tional correctness of two key cryptographic primitives, AES-256-GCM and SHA- 
384 as they currently appear in the new AWS LibCrypto library (AWS-LC) [2]. 
AWS-LC is a general-purpose library maintained by Amazon Web Services for 
use with AWS applications. We targeted these algorithms in particular because 
they are used within AWS and included in the Commercial National Security 
Algorithms Suite [18]. We chose a block cipher and a hashing algorithm in order 
to cover multiple algorithm types and to be representative of other algorithms 
in AWS-LC. 

Cryptographic algorithms have fixed specifications which permit a narrow 
range of designs, and as a result, implementations change slowly. The AES- 
256-GCM and SHA-384 implementations in AWS-LC are identical to those in 
Google’s BoringSSL library, and as a result, our proofs apply to it as well. 
For these primitives, there are only small differences between BoringSSL and 
OpenSSL, and we are confident our proofs would also apply to OpenSSL with 
minor modifications. 

Our proofs show that the implementations of AES-256-GCM and SHA-384 
are input-output equivalent to formal specifications of their expected behaviour. 
We write our specifications in Cryptol [11], a pre-existing high-level language 
designed for use by cryptographic experts. Cryptol specifications are executable, 
so our proofs establish that for any input, the implementation and specifica- 
tion produce exactly the same result. To boot, our proofs guarantee that the 
code is free of undefined behaviour such as memory safety errors, meaning that 
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any remaining correctness errors are local to the code being proved and cannot 
affect the calling context. We do not verify side-channel properties, nor do we 
analyse cryptographic security properties of the AES-256-GCM and SHA-384 
algorithms. 

We performed these proofs using the Software Analysis Workbench 
(SAW) [14]. SAW is an industrial verification tool designed to prove equiva- 
lence properties between abstract specifications and lower-level, more optimized 
implementations. SAW is a bounded verifier: loops must be verified under pre- 
conditions that guarantee termination, and data-structures must be statically 
allocated with bounded sizes. 

We have run our proofs on fixed sizes of input data, i.e., fixed numbers of 
bytes to be hashed/encrypted/decrypted. The number of loop iterations in these 
algorithms are strictly fixed by the input size so this also implicitly bounds the 
execution length. We chose these sizes so as to exercise all branches and boundary 
conditions in the code and specification (in this, we follow Galois and AWS’s 
previous work: see Chudnov et al. [7]). We discuss the scope and limitations of 
our proof in Sect. 7. 

Each proof of a cryptographic primitive in SAW has two stages. In the first, 
the imperative input code is converted to a functional term using bounded sym- 
bolic execution. This depends on a high-fidelity model of the input languages. 
SAW already had an LLVM model used for C and C++ verification. For AES- 
256-GCM and SHA-384 we developed a new SAW model of x86 assembly, along 
with an interface with SAW’s existing LLVM model. As well as modeling core 
x86, this also included modeling special-purpose instructions used to achieve high 
performance. A successful conversion only occurs for well-defined programs, and 
implies that the program is free of undefined behavior under the given precon- 
ditions. 

In the second stage of a SAW proof, the symbolic term is compared to a 
specification term written in Cryptol. For many applications, SAW can discharge 
these equivalences automatically, but this is where the optimizations in AES-256- 
GCM and SHA-384 made verification much more challenging. The proof steps 
involved cannot be discharged automatically by current solvers, so instead, our 
proofs make careful use of rewriting logic to massage the terms into a form that 
can be discharged. Some of these proof steps may be amenable to automated 
solving in future. 

Our proofs were developed collaboratively between a team of expert verifica- 
tion engineers. As well as technical innovation, these proofs also required careful 
proof engineering. By this, we mean the analog of software engineering—a com- 
bination of proof design, tool design, and team working practices which makes it 
possible to execute effectively on a verification goal. We found that to a degree, 
proof engineering is software engineering; that is, successful proof engineering 
has similarities to the practices needed when developing a challenging software 
project. 

Aside from proofs and tool capabilities, there is something else notable about 
our project: we verify code that was never intended for formal verification. This 
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is in contrast to many other efforts, which target systems that were designed with 
assurance in mind. For example, Galois and AWS previously verified an Amazon 
TLS library that was purpose-built as a high-assurance alternative to OpenSSL’s 
TLS support [7], while in the EverCrypt library, code and proof were developed 
in parallel, and even the API was designed to simplify specifications [21]. We 
verify legacy code because this is the code that is actually used in AWS-LC and 
its predecessors. 


Contributions. The key contributions of this paper are as follows: 


— Proofs of correctness for highly optimized versions of AES-256-GCM and 
SHA-384, as they appear in AWS LibCrypto and BoringSSL. 

— A verifier for mixed C and x86 code which allows precise reasoning about 
functional correctness. This capability is built into the industrial verification 
tool, SAW. 

— A simple system of rewrite tactics which is powerful enough to allow verifi- 
cation of highly optimised cryptographic algorithms. 

— Lessons learned in proof engineering when applying an industry verification 
tool to a challenging piece of legacy cryptographic code. 


All proof scripts are available online!. 


1.1 Related Work 


There is a considerable amount of recent work in cryptographic verification, 
representing a large space of application domains and design requirements. While 
our work is widely applicable, we do not consider it a one-size-fits-all solution. We 
discuss how a developer might choose between the many verified cryptography 
efforts in Subsect.7.2. Here we give an overview of projects that target C or 
x86, or that are closely related technically. We do not review work on verifying 
cryptographic security properties, which is orthogonal to the problem of verifying 
that code matches algorithm. 

The closest work to ours in terms of technical approach is Galois and AWS’s 
previous work verifying the HMAC and DRBG primitives in the AWS s2n TLS 
library [7]. Just as we do, they use SAW to verify production cryptographic code. 
The main difference from our current project is the complexity of the primitives 
verified. The HMAC and DRBG primitives are inherently simpler algorithms, 
and are written in C, rather than x86. Furthermore, this code was designed 
for verification, unlike the OpenSSL-derived code we target. In earlier work, Ye 
et al. also verified C versions of HMAC and DRBG from OpenSSL using the 
foundational Verified Software Toolchain (VST) [22]. 

The Everest project has developed verified C/x86 cryptographic library called 
EverCrypt [5,10,21,23]. Recent results are extremely impressive, with perfor- 
mance comparable to highly optimised OpenSSL code. However, EverCrypt 


1 https: //github.com /awslabs/aws-lc-verification. 
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represent a different philosophy from ours, where the library and proof are co- 
designed, and in some cases code is synthesized. This approach looks towards 
a future where such libraries replace hand-written libraries like AWS-LC, Bor- 
ingSSL, and OpenSSL. Our philosophy is complementary: we verify code as it 
currently exists while we wait for the future to arrive. 

EverCrypt also differs in that they use a proof-assistant style of reasoning 
more similar to Coq or Isabelle. The advantage of this is that proofs are very 
flexible—for example, they work for unbounded input sizes. However, the cost is 
that proofs are relatively more verbose. Proof size is hard to estimate in Ever- 
Crypt, because the proof and implementation are mixed, but the earlier Vale 
paper [10] suggests that EverCrypt’s proof of AES-GCM uses 2000 lines of proof 
library plus additional proof mixed in. In comparison, SAW is designed to auto- 
mate reasoning where possible, and the proof of AES-256-GCM implementation 
takes us less than 1000 lines of proof (including white-space and comments, for 
attempted apples-to-apples comparison). 

The CASM [17] project verifies x86-based cryptography taken from 
OpenSSL, including SHA-256 (we verify SHA-384). CASM’s toolchain is sim- 
ilar to ours, based on symbolic execution and SMT solvers. However, CASM 
only examines functions over message blocks, rather than the whole SHA-256 
algorithm. CASM also does not verify the most highly optimised versions of this 
algorithm. For example, it omits x86 EVP and vector operations, two of the 
main challenges. 

Fiat Crypto [9] is a related approach, although it does not apply to the algo- 
rithms proved in this paper. It foundationally generates portable C field arith- 
metic implementations from a high level specification. Code synthesized by Fiat 
Crypto has already been added to OpenSSL. Jasmin [1] is another foundational 
synthesis approach. It generates high-performance vectorized x86 implementa- 
tions. The Jasmin implementation of ChaCha20-Poly1305 outperforms similar 
hand-optimized implementations. We have not seen Jasmin implementations of 
SHA-2 or AES-GCM. 

SAW’s approach has some similarities to model checking, in that it is a 
bounded verification technique. However, proofs are based on symbolic execu- 
tion, that is, construction of logical terms representing the program denotation, 
and proofs are bounded on input buffer size, not program execution length perse. 


2 Project Design Constraints 


Our objective in this project was to verify the cryptographic code which is actu- 
ally deployed, and to ensure it stays verified as it changes over time?. To do 
this, we used continuous reasoning, a term due to Peter O’Hearn [19]. In contin- 
uous reasoning, there is a tight connection between code, software engineering 
process, and verification tools. Several recent industry projects have success- 
fully used continuous reasoning practices. It was also important that our tools 


2 In fact, we do not expect AES-256-GCM and SHA-384 to change often in AWS-LC, 
but this work takes place in the context of a larger AWS-LC assurance project. 
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maintain the existing institutional trust in the original codebase—this ruled out 
whole-code replacements such as EverCrypt. This resulted in the following design 
constraints: 


— Proofs had to run on the executed code, rather than a model/abstraction. 
This was to minimize the trusted base, and ensure that our proofs stayed in 
sync with the code as it evolved. 

— Proofs had to run automatically with a low enough time budget to integrate 
with continuous integration checking. This ensures that errors are detected 
at the time code is changing, which increases the probability of fixes. 

— Proofs had to avoid modifications to the original source code, and instead 
exist as separate supporting files. Our experience is that teams are typically 
very reluctant to modify original source code, even with non-functional anno- 
tations. 

— The proof toolchain had to operate independently of the software build sys- 
tem. This was to avoid introducing untrusted tools into critical development 
pathways. 


These constraints led us to use the SAW tool as our basis for verification [14]. 
Our project can be seen as a follow on to Galois and AWS’s prior verification 
of AWS s2n which had many of the same design objectives [7]. Chudnov et al. 
showed that SAW can be used for continuous reasoning for a relatively simple 
piece of C cryptography. The difference in our current project is the inherent 
difficulty of verifying the code. 


3 AES-256-GCM and SHA-384 Proof Structure 


Conceptually, SAW’s approach to proof works as follows. The tool symbolically 
executes C and x86 code, resulting in a collection of functional terms. A term 
describes every program output mathematically as a function of program inputs. 
Once side conditions have been discharged, completion of symbolic execution also 
implies that the program is safe: that is, memory safety errors cannot occur. In 
the final step of the proof, these functional terms are compared to specifications 
using a solver to determine whether they are equivalent. 


Interfaces. At the top level of our proof, we verify the AWS LC primitives against 
OpenSSL’s EVP interface®. OpenSSL and its descendants use this interface to 
make it easy to swap out algorithms without exposing their implementations. 
This complicates the verification task by hiding functions behind pointers and 
union types. It has also attempted to remain largely backwards compatible for 
years, resulting in an API that is not as clean as it might be otherwise. Perhaps 
for these reasons, previous cryptographic verification projects have not verified 
the EVP interface. 


3 https: //wiki-openssl.org/index.php/EVP. 
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let EVP_CipherUpdate_spec enc gcm_len len = do { 
// ... some cipher set-up omitted (5 lines) 


cipher_data_ptr <- crucible_alloc_aligned 16 (llvm_struct "struct .EVP_AES_GCM_CTX") ; 
points_to_EVP_AES_GCM_CTX cipher_data_ptr ctx mres {{ 1 : [32] }} Oxffffffff; 


ctx_ptr <- crucible_alloc_readonly (llvm_struct "struct.evp_cipher_ctx_st"); 
points_to_evp_cipher_ctx_st ctx_ptr cipher_ptr cipher_data_ptr enc; 


(in_, in_ptr) <- ptr_to_fresh_readonly "in" (llvm_array len (llvm_int 8)); 
out_ptr <- crucible_alloc (llvm_array len (llvm_int 8)); 
out_len_ptr <- crucible_alloc (llvm_int 32); 


crucible_execute_func [ctx_ptr, out_ptr, out_len_ptr, in_ptr, 
(crucible_term {{ “len : [32] }})]; 


let ctx' = {{ cipher_update enc ctx in_ }}; 
// ... some cipher invariants omitted (3 lines) 


crucible_points_to out_ptr (crucible_term {{ ctr32_encrypt ctx in_ }}); 
crucible_points_to out_len_ptr (crucible_term {{ “len : [32] }}); 
crucible_return (crucible_term {{ 1 : [32] }}); 


Fig. 1. Part of the EVP interface for AES-256-GCM. 


SAW-Script Specifications. The top-level EVP specifications are defined in SAW- 
script, the high-level control language for SAW. Figure 1 shows part of the SAW- 
script EVP interface for AES-256-GCM. In its form, this interface consists of a 
series of instructions in SAW-script, but in its effect, it is a Hoare-style pre/post 
specification. The interface sets up symbolic memory (the pre-condition), sym- 
bolically executes the function (crucible_execute_func), and then checks that 
the resulting symbolic memory contains the correct values (the post-condition). 

For AES, the main purpose of the pre-condition is to define the layout of 
memory that results from the AES initialization function. Because we define 
post-condition for the initialization function that match the specification given 
here, we can end-to-end verify the common use case of initializing memory, 
encrypting some input, and returning the result. 

The script defines the memory pre- and post-conditions for the function using 
points-to assertions. In SAW-script, we allocate symbolic memory at specific sizes 
using the crucible_alloc commands. We can then use the points_to com- 
mand to specify that a pointer points to symbolic memory. The ptr_to_fresh 
command is a convenience function that allocates a pointer, and then initializes 
it with symbolic memory. 

SAW’s logic is less expressive than a full separation logic, but specifications 
can naturally be interpreted in terms of separation, including the property that 
memory cells do not overlap. To make the memory layout easier to understand, 
consider the following separation logic triple, which roughly corresponds to the 
layout defined in the SAW-script: 
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{cipher_data_ptr +> ctx... x in_ptr'> in * out_ptr > (_: [len])} 
EVP_CipherUpdate(ctx_ptr, out_ptr, out_len_ptr, in_ptr, len) 


cipher_data_ptr +> cipher_update(ctx...) * 
in_ptr œ> in * out_ptr + ctr32_encrypt(ctx, in) 


Rather than syntactically divide the pre-condition and post-condition, as 
in a Hoare triple, the two are divided by the call to crucible_execute_func, 
which indicates symbolic execution of the target C or x86 function. Crucible 
is the intermediate language for symbolic execution used by SAW. Internally, 
the semantics of LLVM, x86, and other SAW input languages are defined by 
translation to Crucible. 

One reason for the complexity of these specification is that SAW differentiates 
between data that is allocated and initialized and data that is just initialized. 
Other verification tools tend to treat all allocated data as initialized (for example, 
this is true of CBMC [8]). This is generally a sound approximation because C 
compilers tend to behave predictably, but our approach is more accurate to the 
specification of C. 


Functional Specifications. The other role of SAW-script is to verify the connec- 
tion between the implementation and algorithmic specification. In SAW, speci- 
fication are written in Cryptol, a domain-specific language designed for crypto- 
graphic specifications [11]. In the postcondition of the script, we use references to 
Cryptol functions to map the outputs of running the program to the outputs of 
our specification programs, ctr32_encrypt and cipher_update. The final lines 
of the specification assert that the memory cells resulting from the program 
must match the required values, i.e., those that would result from executing the 
Cryptol specification. 

We show ctr32encrypt in Fig. 2. This function defines the top-level behavior 
of the CTR mode of encryption, which repeatedly increments an initialization 
vector, encrypts the incremented value with the secret key, and performs an 
XOR of that encryption with the plaintext. 

The first line of the specification defines the type of the function, parame- 
terized by type variable n. AES_GCM_Ctx is a structure used to maintain state 
for the incremental interface to AES, which allows for data to be encrypted and 
decrypted as it becomes available, rather than all at once. The [n] [8] arguments 
are sequences of bytes with length n. 

The function body consists of a sequence comprehension. This takes input 
bytes one at a time, and labels them with i, which draws from the sequence 
counting up from ctx.len. The separate function EKij performs the encryption 
step using the initialization vector and the key contained in the context. The 
take and drop functions are used to convert the 64-bit length contained in the 
context to a 32-bit number required by the EKij function. 
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ctr32_encrypt : {n} (fin n) => AES_GCM_Ctx -> [n][8] -> [n] [8] 
ctr32_encrypt ctx in = out 
where 
out = [ byte ^ (EKij ctx ((take~{32} (drop’ {28} i)) + 1) (drop {60} i)) | 
byte <- in | i <- [ctx.len ...] ] 


a 8 u N H 


Fig. 2. Top-level Cryptol specification for AES update. 


Another example of a functional specification is the following line describing 
the Sigma0 function: 


SO x = (x >>> 28) ^ (x >>> 34) ^ (x >>> 39) 


In the SHA-384 code, this function is implemented by the Perl code given 
in Fig. 3. This does not execute directly, but rather generates assembly code, 
which is what we verify. The instructions ror and xor correspond to the cryptol 
operations >>> and ^ respectively. 

In order to include the implementation here, some constants have been sub- 
stituted, and we have extracted the relevant lines from around 20 other lines 
calculating other parts of SHA. Those lines are mixed in with even more lines 
of non-interfering SHA calculations, presumably in order to keep the processor 
saturated. Symbolic execution allows us to reason just about these lines of code, 
because interleaved instructions that don’t change the result of the computa- 
tion in a relevant way will not be included when reasoning about the results of 
individual computations. 


1 '&ror ($a1,39-34)', 

2 '&xor ($a1,$a)', 

3 '&ror ($a1,34-28)', 

4 '&xor ($a1,$a)', 

5 '&ror ($a1,28)', # Sigma0 (a) 


Fig. 3. Perl implementation of internal SHA computation. 


Notice also that the shift amounts are different between the functional speci- 
fication and the code. In Cryptol, the shift amounts are 28, 34, and 39, but in the 
implementation, we see shifts by 39 — 34, 34 — 28, and 28. This is a performance 
optimisation, but it makes the proof effort more difficult. To close this gap, we 
use a system of verified rewrites (see Sect. 6). 


Verification Process. Once a specification has been defined, it must be veri- 
fied. SAW divides verification into two phases: symbolic execution, and verifica- 
tion of equivalence. Symbolic execution converts an imperative operation into a 
functional term suitable for automated reasoning. Even without specifying the 
expected high-level behaviour of the AES function, the memory layout defined 
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in the pre-condition is enough for symbolic execution to complete, which has 
the effect of proving the imperative code memory safe. We typically verify mem- 
ory safety in this way before developing a specification. This lets us separate 
concerns between functional and safety properties. 

The final task once symbolic term has been generated is to compare it to 
a specification term. SAW uses SMT solving to discharge these proofs, and in 
most use cases, these can be completed automatically. However, the complexity of 
the optimization stages in AES-256-GCM and SHA-384 makes the gap between 
specification and implementation too large to be completely automated. SAW 
solves this with a small tactic language embedded into SAW-script that supports 
term rewriting. Each of these rewrites Sect. 6. 


Modular Reasoning. Symbolic execution is a precise technique with hard limits 
on its scalability. The AES-256-GCM and SHA-384 functions are too large to be 
symbolically executed in their entirety. SAW solves this problem through using 
a modular reasoning system called overrides. SAW treats specifications as exe- 
cutable code that can be freely substituted for implementation functions. When 
a function is verified equivalent to a Cryptol specification, calls to that function 
can be overridden (i.e., replaced) during symbolic execution. As Cryptol speci- 
fications are typically much less complex than implementations, this massively 
increases the tractability of the verification task. 

As a result, a typical SAW proof consists of a hierarchy of equivalence proofs. 
The proof begins at the leaf functions, which are verified by symbolic execution. 
The functions at the next level are then symbolically executed with the leaf func- 
tions replaced by their specifications. These are then also added to the library 
of verified functions. This proceeds until the top-level function is verified. One 
of the main tasks when developing a SAW proof is defining these internal spec- 
ifications (our proof is unusual in that we also needed a significant number of 
rewrite rules). 

We also use the override mechanism at the interface between C and x86 
code. Functions in x86 are proved equivalent to Cryptol specifications, and these 
specifications can then be used as overrides in the surrounding C context. This 
approach works because we have defined a compatible memory model that works 
for both C and x86 code—see Sect. 5 for more. 

Finally, the override functionality can be used to assume specifications for 
functionality that has been assumed, not verified. This is useful for library calls 
that might be out of scope for a particular project, but that might be verified 
in the future. For example, Chudnov et al.’s SAW proofs for s2n [7] use this 
approach to parameterize the proofs of HMAC and DRBG over different prim- 
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itives*. In our proofs, the only assumptions we make are that OPENSSL_malloc 
and OPENSSL_free behave correctly. 


4 SAW’s Verification Pipeline 


SAW is structured as a pipeline of linked verification stages. The inputs to 
the pipeline are, firstly, executable mathematical specifications for the top-level 
function, and selected sub-functions; secondly, the compiled code, made up of 
LLVM and embedded x86 binary code; and thirdly, a proof script which sets 
up memory, identifies the mapping between Cryptol specifications and function 
interfaces, and contains the rewrites that are applied to the resulting logical 
terms. The verification pipeline then works as follows: 


1. The x86 binary is extracted from the LLVM and decompiled into a CFG 
representation that recovers the x86 instructions and control-flow structure. 
This relies on a SAW sibling project called Macaw [12]. 

2. The x86 control-flow graph and LLVM code are divided into functions at the 
interfaces identified in the SAW-script file. 

3. Beginning at the leaves of the call-graph, each x86 and LLVM function is 
symbolically executed, resulting in a term written in a intermediate language 
called SAW-core. At this stage, any already-verified functions are substituted 
for Cryptol overrides. 

4. If a function has an associated Cryptol specification, it too is symbolically 
executed, resulting in a specification term in SAW-core. 

5. The function term and specification term are rewritten using the rewrites 
defined in SAW-script. 

6. The rewritten function and specification terms are proved equivalent through 
a generic solver interface library called What4 [15]. 


Verification proceeds with functions progressively higher on the call-graph, until 
the top-level equivalence is proved between code and specification. 

While the structure of this pipeline is simple, making it work for real code 
requires a significant amount of tool sophistication. SAW is the product of many 
years of refinement and development, and we used many of the components in 
this pipeline without modification. 

Our C support is based on SAW’s LLVM support, which is mature, and 
has been used in many other industry and government verification projects—for 
example, Chudnov et al. [7]. While we do not claim complete coverage of the 
standard, in practice we rarely need to add new C language features to SAW. 
Likewise, Cryptol support is built into SAW and is designed to be symbolically 
executed, so this part of the tool required no modifications. The Macaw and 
What4 tools similarly functioned without modification. 


“In fact, we have now verified some of the primitives that were only assumed in 
this previous work, meaning it should be possible to stitch these proofs together 
end-to-end. 
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Therefore, in this paper we focus on the new capabilities of the tool: our 
symbolic execution of x86 instructions, and verified rewrites. For a more detailed 
treatment of the SAW suite as a whole, readers should look at the SAW docu- 
mentation and tutorial [13,14]. 


5 New Capability: x86 Semantics 


The first SAW capability we developed for this project was symbolic execution 
for x86 assembly code, including support for mixed C/x86 code. Doing this 
required us to solve two problems. First, decompiling the binary into a series 
of x86 instructions, and second, defining the semantics of instructions, which 
mainly involves defining the model of memory. 

To decompile we use Macaw, a SAW sibling project which is able to parse 
Elf binaries and output a control-flow graph complete with the representation 
of the x86 instructions [12]. We treat Macaw as a black box, and in fact any 
decompiler with similar capabilities could serve in its place. 

Once the CFG has been constructed, we apply our x86 semantics. For the 
behaviour of individual instructions, we consulted the Intel manual. We note 
that processor manuals contain errors, and hand-encoding the semantics could 
also introduce errors. However, we have reasonable confidence in this encod- 
ing because, in practice, most conceivable errors would immediately cause the 
proof to fail. This is because cryptographic functions are very sensitive to small 
changes: most small value errors would result in a dramatically different output. 

Much more important and subtle is the memory model, which describes under 
what conditions reads and writes to memory can occur, as well as describing 
how reads and writes can be combined to store and retrieve values. Unlike C, 
there are almost no accepted memory usage rules for assembly programming, 
aside from the conventions used in a particular program and the Application 
Binary Interface (ABI) for functions that can be called externally. Fortunately, 
AES and SHA implementations are designed to be called by C programs. They 
therefore must follow C-like conventions and respect the ABI. Memory is used 
to get inputs and define outputs, read global constants, and maintain a stack for 
storing temporary results. Functions always respect the boundaries of data as 
provided. Because of this, we were able to adapt SAW’s well-tested model used 
for LLVM support. 

In SAW’s memory model, addresses are represented by a pair of integers: the 
first integer is a base address, identifying an allocated memory region, while the 
second is an offset into the region. Memory operations, such as pointer arithmetic 
and pointer comparisons, are only well-defined for addresses in the same region. 

Even after defining this model, we had to decide how to apply it within the 
proof. There were two options: (i) modeling the entire memory as a single region, 
and (ii) representing different objects as separate regions. The former is the more 
flexible because it does not enforce any invariants on the way that memory is 
used. Any read or write within the entire memory region is valid at any time. 
This comes at an increased cost of manually specifying necessary invariants. For 
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example, each function would have to manually encode the memory region it 
might write to so that its calling function can predict all of the side effects of 
calling it. 

Instead, we take the second approach: automatically specifying such mem- 
ory invariants as part of the way that memory can be used. This means that 
some valid assembly will be impossible to verify. It could be completely safe 
and correct, but because it violates the strict memory model we’ve chosen, our 
tool will be unable to reason about it. On the other hand, the memory model 
we chose works for all of the cryptographic assembly code we’ve run into, and 
implementing the memory model in this way saves us a substantial amount of 
specification and proof work. 

It is not surprising that this approach works; The models and abstractions C 
uses for memory are useful in assembly as well. Furthermore, the ABI and the C 
memory model have heavily co-evolved, making the C memory model a natural 
fit for assembly functions that match the ABI. 

The memory model is applied by symbolic execution of the CFG that results 
from Macaw. This symbolic execution has two main functions: efficiently update 
a symbolic representation of memory, and discharge side conditions that must 
hold in order for symbolic execution to continue. The result is a SAW-core term 
representing the input-output behaviour of the x86 binary code. 


6 New Capability: Verified Rewrites 


The second SAW capability we developed was a simple language of term rewrites 
for use in proofs. 

After symbolic terms have been constructed from C, x86, or Cryptol, we 
must prove equivalences between these terms. The design goal with SAW is that 
these proofs are completed mostly automatically using SMT solvers. While this 
has worked well in previous, less-complicated proofs, the functional terms that 
result from AES-256-GCM and SHA-384 often proved to be intractable for the 
solvers without preprocessing. This is exactly because these algorithms are so 
heavily optimised, as we have discussed above. 

In order to solve this, we introduce a language of equivalences between terms 
that are themselves verified by the solver. By applying these rewrites, we can 
close the gap between the more abstract Cryptol term and the optimized C/x86 
term. These rewrites serve as a small tactic language for controlling the proof, 
while preserving the principle that SAW proofs are mostly automatic. 

To illustrate how this works, we consider an example rewrite from our SHA- 
384 proof. In the Cryptol portion of our proof, we define the following function, 
SO (shortened for convenience from Sigma0): 


SO x = (x >>> 28) ^ (x >>> 34) ^ (x >>> 39) 


In Cryptol, >>> and <<< are right and left rotation respectively, while ~ is XOR. 
In order to complete the proof, we need to be able to rewrite occurrences of this 
function. To do this, we define the following rewrite, SigmaO_thm: 
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SigmaO_thm <- prove_folding_theorem 
{{ \x -> (x ^ (C(x ^ (x <<< 59)) <<< 58)) <<< 36 == SO x }}; 


The left hand side of this equation is how symbolic execution interprets 
the code in Fig.3. The rotate-rights have been swapped to rotate-lefts, which 
allows our semantics to model both types of instruction by rotate-left. In order 
to swap the rotates, we subtract the rotate amount from 64, which is why we 
have a different set of constants than we see in either the specification or the 
implementation. 

The solver verifies this equivalence for all possible values of x and saves it 
with the name Sigma0_thm. In this case, we verify the equality using the ABC 
solver [6] through What4, but different solvers can be applied as needed to 
provide different equivalences. 

Consider the following SAWscript command, which verifies an x86 function 
matches its specification: 


sha512_block_data_order_spec <- 
crucible_llvm_verify_x86 m "<filename>" "sha512_block_data_order" 
[ ("K512", 5120) ] // Initialize global for round constants 


true 

sha512_block_data_order_spec 

(do { 
simplify (cryptol_ss ()); // std simplifications 
simplify (addsimps thms empty_ss); // folding theorems 


simplify (addsimp concat_assoc_thm empty_ss); // final theorem 
w4_unint_yices ["SO", "Si", "sO", "si", "Ch"]; // uninterpreted fns 
H; 


Here, the do-block defines the order in which the simplification rewrite rules 
are applied. The folding theorems thms contains 30 rewrite rules, including the 
Sigma0_thm presented above. The concat_assoc_thm theorem normalizes the 
concatenations that result from other proof rules. The final line of this script 
instructs the Yices solver to treat certain functions as uninterpreted, including 
the SO function. This illustrates the usefulness of the rewriting support. Rather 
than reasoning about the SO function directly, we rely on the verified rewrites. 
This allows us to abstract away from complexity that previously made the proof 
infeasible for the solver. 

Overall, the tactics we use for SAW proofs constitute a very simple decision 
procedure, made up almost exclusively of user-supplied rewrites. The other main 
mechanism we have for guiding proofs is the modular override system described 
in Sect. 3, which allows us to decompose proof tasks into lemmas, at least at the 
granularity of functions. In practice, we have found that these capabilities are 
sufficient to meet the needs of proving cryptographic implementations correct 
with respect to specifications. 

Ultimately, we may find ourselves limited by the tools available in SAW for 
controlling the proof process, particularly if we attempt to prove higher degrees 
of abstraction between specification and implementation. These more manual 
proofs largely fall outside of the scope of what SAW aims to do well. An ideal 
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solution would be to export proof goals to Coq, Lean, or F*, all of which already 
have highly-usable better tools for manual proof. Chudnov et al. have previously 
demonstrated that SAW proofs about code and more abstract Coq proofs can 
be connected in this way. 


6.1 Role of Rewrites in AES-256-GCM and SHA-384 Proofs 


Rewrites in SAW can be seen as a small tactic language, serving a similar purpose 
to proof tactics in Coq or Isabelle. However, SAW occupies a very different point 
in design space, because it is designed to maximize proof automation. Heavy use 
of SMT-backed automation is the reason our proofs were feasible, but if the 
automation makes poor choices, it can also obstruct the proofs. We use rewrites 
along with appropriate choices of abstraction boundaries, to recover abstractions 
that automation would not discover itself. 

For example, consider the Sigma0_thm rewrite defined above. The solver can 
verify the rewrite when supplied in isolation. However in the context of SHA, 
the solver fails to identify this as a valuable fact. One reason is that SO is a 
function that is present in the Cryptol specification, but this abstraction is lost 
when we symbolically execute an x86 function. The rewrites replace occurrences 
of SO with an uninterpreted function, pruning the proof space dramatically. 

However, there is the trade-off in reintroducing such an abstraction. Even if 
the abstraction holds locally, the functionality that calls that abstraction might 
depend on the internal functionality. In that case, swapping out the code for 
an uninterpreted function could actually turn a solvable goal into an unsolvable 
one. The answer is to choose these rewrites carefully: this is one of the main 
intellectual challenges in completing a proof. In general, this problem is unde- 
cidable. For example the rewrite rules inferred may not terminate, this means 
that at best it might be a guided special-purpose mode of solvers, rather than a 
general purpose approach. 

Our rewrites plug into SAW late in the pipeline, after many of SAW’s opti- 
mizations. This means that rewrites sometimes have to compensate for earlier 
optimizations. SAW is designed to aggressively optimize terms into a form suit- 
able for the solver, and in some cases, this means breaking up abstractions that 
would be useful in completing the proof. In these cases, our rewrites must operate 
on the post-optimization proof term. 

For example, in one case, it would have been desirable for our proof to use 
the term: 


{{ \x -> (slice_59_5_0 x) # (slice_0_59_5 x) == x <<< 59 }}; 


However, SAW discovered that it could drop off the operation on the final 
byte of x, but to do so, it had to break up x into its constituent bytes. This 
is a desirable optimization if the term is passed to the solver directly, because 
the solver itself will reason at the level of bytes. However, this made writing an 
appropriate rewrite for our proof much more challenging. We include the eventual 
rewrite rule in Fig. 4. Again, a large amount of the intellectual challenge with our 
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proof rested in finding appropriate rewrites that integrated with SAW’s existing 
automation. 


rotate59_slice_add_thm <- prove_folding_theorem 
{{ \xO x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 
x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 
x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 -> 
(slice_59_5_0 (x0 + x2 + x3 + x4 + x5 + x6 + x7 +x8 + x9 + x10 + x11 
+ x12 x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 
+ x21 x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 
+ x30 x31 + x32 + x33 + x34 + x35 + x36 + x37 + x38 
+ x39 x40 + x41 + x42 + x43 + x44 + x45 + x46 + x47 
+ x49 + x50)) 
(xO + (64 * x1) + x2 + x3 + x4 + x5 + x6 + x7 +x8 
+ x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 
+ x18 + x19 + x20 x21 + x22 + x23 + x24 + x25 + x26 
+ x27 + x28 + x29 x30 + x31 + x32 + x33 + x34 + x35 
+ + 
+ + 


++ tet 


# (slice_0_59_5 


x36 + x37 + x38 x39 x40 + x41 + x42 + x43 + x44 
x45 + x46 + x47 + x48 + x49 + x50)) 
== (x0 + (64 * x1) + x2 + x3 + x4 + x5 + x6 + x7 +x8 + x9 
+ x12 + x13 + x14 + x15 x16 + x17 + x18 + x19 + x20 
+ x23 + x24 + x25 + x26 x27 + x28 + x29 + x30 + x31 
+ x34 + x35 + x36 + x37 x38 + x39 + x40 + x41 + x42 
+ x45 + x46 + x47 + x48 + x49 + x50) <<< 59 3}; 


++ ++ 


x10 
x21 
x32 
x43 


x11 
x22 
x33 
x44 


+++ + 
++ + + 
++ + + 


Fig. 4. Example rewrite rule. This rule is made much more complex by the fact it 
happens after SAW’s existing term optimization phases. 


7 Results and Lessons Learned 


Our proofs run on the current version of AWS-LC [2] as of January 2021, built 
using the default compiler flags. We verify the AVX implementation of SHA-384, 
which is the current fastest implementation. Our AES-256-GCM proof uses the 
code path for AESNI, CLMUL, and AVX instructions. 


Proof Size and Composition. Our code can be broken down into top-level func- 
tional specifications, top-level interface specifications, and proof scripts. The 
top-level specifications are what must be understood in order to understand the 
results of our proofs. 

We have 168 lines of top-level interface specifications, which define the 8 
interface functions that we’ve proved correct. Those functions specify memory 
layouts for the interface functions and link them to the top-level functional spec- 
ification. We have 435 lines of top-level functional specifications, which were only 
slightly modified from specifications that we and others have used in previous 
cryptographic verification projects. These are almost completely free of imple- 
mentation details, and live in a specifications only repository, separate from the 
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code. If the functional specifications were made any shorter, they would likely 
also be less readable, so we believe they are close to optimal for their purpose. 

The proof scripts consist of 1286 lines of intermediate function specifications, 
rewrite rules, tactics, and proof running logic. These intermediate functions are 
both proved and checked each time they’re called. As a result, they do not need 
to be understood or trusted in order to believe the top level results. 

Following continuous reasoning practice [19], our proofs are integrated into 
the CI process for AWS-LC. We do not expect this code to change, but we have 
adopted this practice as part of a larger AWS-LC assurance effort, including code 
which does change more often. Quickcheck versions of our proofs run as GitHub 
actions that take around 25 minë. The complete version runs on private systems 
in 30min, but using more cores and memory. A significant part of our proof and 
tool development effort was dedicated to making sure proofs could run within 
a time budget acceptable for CI (typically 1h). For example, this sometimes 
required introducing overrides to break the proof into smaller segments. 


Achieving Trust in the Proof. SAW is designed to increase confidence in software, 
but it cannot supply total certainty. A key question is therefore what parts of 
the toolchain and proof must be trusted. For our proofs, the Trusted Code Base 
(TCB) consists of: 


— The top-level functional and interface specifications in the proof scripts. 

— Library behavior that is assumed and then used in overrides. In our case, that 
OPENSSL_malloc and OPENSSL_free behave correctly. 

— The SAW and Cryptol toolchain, the tools themselves, the language models 
of x86 and LLVM, the back-end SMT solvers, and ultimately the Haskell 
runtime and other downstream infrastructure. 

— Correctness of the compilation chain from LLVM to executable (for C code), 
and correct execution of compiled code by the hardware. 

— Any behaviours of code not covered by proofs at the fixed sizes we have 
verified. 


Although this TCB is significant, it is comparable in scale to similar verifica- 
tion projects like EverCrypt [21]. The highest impact improvement would likely 
proving the algorithms at arbitrary sizes. While we could throw computation 
at running the proofs at a wide range of fixed input sizes, this would spend 
computation and developer wait time with fairly little benefit. We believe an 
inductive approach is achievable in future work and would allow us to verify the 
algorithms once and for all. 

In the mean time, we have covered some of the most-used block sizes, as well 
as all code paths. Given we have verified the algorithm at fixed sizes, and the 
code does not branch on input size, the only place that bugs could remain is the 
looping behaviour at other sizes. We have inspected the dynamically bounded 
loops in the code carefully to mitigate this possibility. 


5 https: //github.com/awslabs/aws-lc-verification/actions. 
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We could also shrink the TCB by applying foundational techniques such as 
used in the Verified Software Toolchain project [3]. This would remove much of 
the need to trust the tool itself. However, we believe doing this would make it 
infeasibly expensive to develop a tool as complex as SAW, at least given current 
foundational techniques. 

Another important question is whether a correctness failure in the toolchain 
could result in a proof that does not establish the result we expect. We believe 
the probability of this is quite low. Our current best defense against this failure 
of TCB is thorough testing and code review. SAW itself is a well-tested tool that 
has been used for many projects. The language models have also been tested, 
and it is unlikely that a behavioural bug could cause an incorrect specification 
to be verified. For this to occur, several failures would need to occur at once. 

As an aside, the intent of the people doing the proof, and the precise nature 
of any external review should be considered when answering this question. A tool 
bug is unlikely to result in a proof that falsely appears correct, assuming that 
the proof effort is done in good faith. On the other hand, all tools have bugs, 
and in most logic-based tools, bugs can allow the construction of false proofs 
that appears superficially correct. In other words, for most current tools, trust 
in verified code requires trust in the team and process producing the proof. 

We believe the highest risk of accidental error lies in the specifications. It 
is quite common for draft specifications to contain subtle discrepancies between 
what users intends and the specification’s formal meaning. We mitigate this 
with extensive manual audit. Every line of code we write is reviewed at least 
once within the verification team, and once by AWS-LC domain experts. The 
internal review ensures that our specifications are correct and that our style is 
consistent with our guidelines. The external review allows us to ensure that we 
have explained our proofs correctly, and that we have correctly specified the 
functions in the context that they are being used. 


Proof Engineering Process. The proofs were completed over six calendar months, 
using approximately nine person-months engineering effort total. We consider 
this to be an upper bound estimate as the proof effort was mixed in with tool 
improvements, in particular for the less-mature x86 tooling. The core team con- 
sisted of four engineers, with additional contributions from verification tooling 
experts and AWS-LC domain experts. This project represented a significant 
engineering effort, but for our project, this represented a good use of resources 
to achieve a high level of confidence in the AWS-LC code. Proofs were completed 
alongside more traditional assurance approaches, e.g., testing, fuzzing, and code 
audits. 

New proof techniques and tooling were a factor in our success, but there is no 
single technical breakthrough that made these proofs possible. While combined 
x86 and C verification is challenging, it would likely be possible (although not 
easy) to add such a capability to a number of existing tools. Rather, a series of 
tool extensions, design choices, and engineering working practices combined to 
make the project feasible. 
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Using SAW, we automated most of the trivial reasoning, which meant that 
a majority of proof engineering was spent on legitimately difficult verification 
problems. These mainly involved understanding the code being verified and using 
rewrites to manually rearrange verification terms to make them amenable to 
automated proving. Many of these steps could in principle be automated, but 
in practice engineers sometimes needed to resort to clunky debugging measures. 
We find it unsurprising that highly specialized code such as AWS-LC would 
generate edge cases that challenge generic proof automation. For proofs of this 
type, for now we believe completely automated proving is out of reach. 

We take several steps to try to minimise engineer effort when building proofs. 
The most important of these is to lean on automation wherever possible. One 
example is that we try to avoid internal specifications, which are often the most 
challenging part of the proof. Because SAW is a bounded verifier, internal specifi- 
cations are just a performance optimization—given sufficient compute resources, 
we could in principle symbolically execute the entire code-base. Of course, in 
practice, internal specifications are needed to make the proof tractable. Our 
practice is to prove functions at the largest scope which fits within our time 
budget. By doing this, we are sometimes able to avoid specifying internal func- 
tions that do relatively little computationally. 

Another important strategy for us is to separate memory-safety proofs from 
functional correctness proofs. We have found that much of the technical risk 
in a verification project can be eliminated at the memory-safety stage. This is 
where the verification tools are most likely to run into show-stopping bugs that 
will put success of the project in jeopardy. Separating these concerns results 
in proof terms that are smaller and easier to understand, so bugs are easier to 
diagnose. Then, if we run into challenges during the correctness proving phase, 
we can limit the cause to correctness properties, eliminating a large fraction of 
the proof from consideration. 

An important factor that enabled us to carry out these proofs is a team of 
expert proof engineers who have built their skills over years. This project was 
undertaken by a team which has worked continuously on verification projects for 
four years. This expertise has given us a better understanding of what we can 
attempt, and a far wider toolkit to dip into when things go wrong. We have seen, 
anecdotally, similar evidence of improved verification capabilities from other 
long-standing teams—for example, for the Project Everest, SeL4, and CompCert 
projects. Long-standing teams of proof experts are still unusual, but we believe 
they will be necessary to achieve the most ambitious proof engineering tasks, 
just as they are in software and tool development. 

A significant lesson that we have learned about proof engineering is that a 
tool’s behaviour when it fails is more important than success. This is a critical 
aspect of verification tools often overlooked in research papers. Many tools show a 
demo where everything works, but in a proof engineering effort, the vast majority 
of time is spent with a proof that does not work. In that sense, one of the most 
critical aspects of a verification tool is what it does when the proofs are not 
working. SAW provides some support for diagnosing errors, but there is a lot of 
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room for improvement. It lacks tooling to allow proof engineers to easily inspect 
and modify proof terms that are not successfully proving. Furthermore, it has 
inefficiencies that can make repeatedly running and modifying proofs slow and 
painful, increasing the pain of developing proofs and reducing the time that 
engineers can spend on the real challenges of verification. 


7.1 Trade-Offs When Building on Existing Verification Tools 


As we saw in Sect. 6, rewriting is an example where SAW’s existing tooling made 
some parts of our proof more awkward. It is reasonable to wonder whether we 
could have modified SAW to allow more control of the rewriting pipeline. This 
highlights an interesting trade-off that exists when developing proofs using a 
more mature tool like SAW. 

SAW has existed for a decade, and has been developed and improved itera- 
tively over this time. Design decisions such as the order in which optimizations 
occur can sometimes be baked deeply into the tool. This stands in contrast to 
more experimental tools which often have short histories and a relatively clean 
design that can be torn down and refactored easily. SAW also has an active user 
community which relies on it for different verification and assurance tasks. The 
main users are at Galois, Amazon Web Services, and in the US government. 
This means that tool changes need wider approval from a community. Again, 
this stands in contrast to research tools which often have a single designer who 
is also the main user. The effect of this is that changes such as the introduction 
of rewriting must be carefully designed to fit with SAW’s existing architecture. 

The pay-off for these restrictions is an enormous increase in the power and 
scope of what we can achieve with the tool. In the large, we have benefited 
from many features that were developed by independent research teams. For 
example, we rely on the Macaw decompiler, which we used off-the-shelf without 
modification. The SAW LLVM semantics is likewise a product of many years 
of research, which did not require any further work from us. In the small, SAW 
embodies many, many clever tricks and pieces of good design that together make 
verification of challenging problems more feasible. Sometimes working with a 
mature tool imposes costs, but overall we believe it raises the bar for our work 
in a way that easily justifies the cost. 

An open question for us is how we can make such collaboration possible 
across the verification community. Boogie [4] is a good example of a verifica- 
tion technology that has seen use across different teams and institutions. Proof 
assistants and SMT solvers are also widely used as a basis for new tools. How- 
ever, there are still very few software verification tools that have seen significant 
adoption. We believe such tools will be necessary in the future if we collectively 
are to tackle larger and more complex verification problems. 


7.2 Verified Code Generation Versus Verifying Existing Code 


While the approach in this paper results in an artifact that may appear externally 
similar to other state-of-the art verified cryptography efforts, there are some 
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engineering factors that might influence which approach is most appropriate for 
a particular cryptographic use-case. The approach of EverCrypt, Jasmin, Fiat, 
and similar efforts require the user to produce code or a model in a language 
that is specific to the verification system. While these systems have demonstrated 
ability to produce efficient verified implementations, they cannot directly verify 
existing code. Our approach verifies existing code without modification, and 
there are several engineering benefits to this. 

The most significant reason to verify existing code is that producing new code 
or modifying existing code introduces risk. Modifying optimized cryptographic 
code is particularly risky because it is complex, and because an error could have 
a devastating impact on the security of the system. A software project may be 
unwilling to accept the risk of modifying mature code, even if the new code 
is formally verified. For example, OpenSSL and its variants have been tested 
and audited over more than a decade, and this maturity is appealing to many 
software projects. In our approach, the code is verified without any modification. 
Zero new risk is introduced, and the verification process only increases trust in 
the system. A related benefit is that the verified code maintains any existing 
certifications, such as FIPS 140-2. 

Another benefit of verifying existing code is that the verification works on 
the programming languages that are already used in the project, and the build 
pipeline does not require additional compilers or other tooling to support the 
language of the verification system. Having the build depend on this tooling 
can be risky because it is less familiar and less mature compared to the com- 
pilers and build systems that are typically utilized. There is a risk that a build 
pipeline could break or produce incorrect machine code due to a bug or lack of 
understanding of the verification system. In contrast, our approach produces a 
verification pipeline that is parallel to the build pipeline, and a failure in this 
pipeline does not have any impact on the main build pipeline. 

Many cryptographic applications do not have any legacy concerns and never 
plan on maintaining or improving cryptographic code by hand. In those cases, 
EverCrypt, Jasmin, and Fiat all produce trustworthy, high-performance imple- 
mentations that might prove easier to use and understand than what is provided 
by OpenSSL and its variants. Long-term support might be a concern, given these 
are research tools. However the slow-moving nature of cryptographic code makes 
it less likely that the implementations would need modification in the future. 


8 Conclusion and Future Work 


The purpose of formal verification is to allow users to be confident in the soft- 
ware on which they depend. This is the reason that AWS-LC, BoringSSL, and 
OpenSSL are excellent targets for formal verification. Nearly everyone who uses 
the internet relies on this code for security, either through end-user software, 
or through a cloud provider’s infrastructure. Our proofs show for the first time 
that this kind of highly optimised, hand-written code matches its mathematical 
specification. More importantly, we show that such code can be verified for a 
reasonable amount of proof engineering effort. 
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We do not consider our proofs the last word on this code—there are several 
ways in which our work can be improved. Most importantly, we have not yet 
verified the OpenSSL version of AES-256-GCM and SHA-384. Based on inspec- 
tion of the code, we believe the proofs would only need small changes to the 
term rewrites, but this is currently not a high priority in comparison to further 
AWS-LC assurance work. 

There are also several ways we could improve the proofs themselves. We 
have verified this code at fixed input sizes. We believe we have covered all edge 
cases, so the probability that bugs remain is low, but a size-agnostic proof would 
be more complete. Our proofs also rely on term rewriting tactics to close the 
gap between implementation and specification. These rewrites are specialized to 
our application and are therefore the most fragile part of the proof. We believe 
that, with further research, automated solvers could solve many of these logical 
queries without the need for manual tactics (this would also make our proofs 
less fragile against code change). Finally, our proofs say nothing about non- 
functional security properties, such as timing or architectural side channels, nor 
do they connect to cryptographic security proofs. 

We are at an exciting moment for cryptographic verification. It is now possi- 
ble to deploy verified cryptography without compromising on performance. We 
are tantalisingly close to a world where most cryptographic traffic originates from 
verified code, and where new cryptographic primitives are verified as a matter 
of course. For our part, we consider AES-256-GCM and SHA-384 a stepping 
stone to the real prize: a fully verified library of production-grade cryptographic 
primitives. Stay tuned! 
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Abstract. This paper introduces a new property called robust reacha- 
bility which refines the standard notion of reachability in order to take 
replicability into account. A bug is robustly reachable if a controlled input 
can make it so the bug is reached whatever the value of uncontrolled 
input. Robust reachability is better suited than standard reachability in 
many realistic situations related to security (e.g., criticality assessment or 
bug prioritization) or software engineering (e.g., replicable test suites and 
flakiness). We propose a formal treatment of the concept, and we revisit 
existing symbolic bug finding methods through this new lens. Remark- 
ably, robust reachability allows differentiating bounded model checking 
from symbolic execution while they have the same deductive power in the 
standard case. Finally, we propose the first symbolic verifier dedicated 
to robust reachability: we use it for criticality assessment of 4 existing 
vulnerabilities, and compare it with standard symbolic execution. 


1 Introduction 


Context. Many problems in software verification are encoded as reachability 
queries of some undesired condition—a bug, the exploitation of a vulnerability, 
etc. When a verification engine establishes that a certain buggy location in the 
program is reachable, an input triggering the bug is reported to the developer so 
that it can be fixed. In the case of techniques based on an under-approximation of 
program behaviors, like Symbolic Execution (SE) [9] or Bounded Model Check- 
ing (BMC) [13], we even have in principle the guarantee that the reported issue 
is real (correctness): there are no false positives. 


Problem. Yet, things are more subtle in practice, as some bugs can be triggered 
reliably whereas others only happen in very specific and highly improbable initial 
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conditions. While standard reachability cannot tell the difference, this distinc- 
tion is crucial in many real-life scenarios related to security (bug triage, bug 
prioritization, criticality assessment) or software engineering (test suite replica- 
bility and the problem of flaky tests [42]). For example, fuzzers are able to detect 
so many bugs [38] that they can lead to “bug triage issues” [30]. If each replicable 
(reliably-triggered) bug is hidden by dozens of more fragile ones in the reports 
of a verification engine, it is hard to focus development effort efficiently. Also, 
if one is only interested in vulnerability reports, bugs which cannot be reliably 
triggered may even be dismissed as “not exploitable” altogether. 


Goal and Challenges. Our goal is to develop a formal framework able to dis- 
tinguish replicable bugs from fragile bugs, and amenable to automatic software 
verification—precisely, we want to be able in practice to find such replicable bugs. 
This is challenging as we need to avoid any quantitative [37] or probabilistic rea- 
soning [2,34], insofar as they would hinder automation on real examples—these 
techniques are often either restricted to finite-state systems [2,34] or rely on 
highly expensive model counting solvers [11,39]. 


Proposal. Our approach consists in partitioning inputs of the program into con- 
trolled inputs and uncontrolled inputs. This lets us refine the concept of reachabil- 
ity into robust reachability: a (buggy) location of a program is robustly reachable 
if there exist controlled inputs, such that for all uncontrolled inputs, this location 
is reached. In other words, with adequate input we do not need luck. 

We typically focus on security scenarios where an attacker provides controlled 
input in one go, without knowledge of uncontrolled input — typically sending a 
malicious crafted file to obtain remote code execution or privilege escalation. We 
deliberately exclude interactive attack scenarios and weaker interpretations like 
“bugs replicable most of the time” in order to keep proof methods tractable. 

Proving robust reachability is harder than standard reachability. While we 
show that robust reachability is expressible in formalisms like branching tempo- 
ral logics [14], hyperproperties [16] or hyper temporal logic [15], there exist no 
efficient automated analysis methods for these formalisms at the software level 
(for Turing-complete languages). Therefore, we investigate dedicated verification 
techniques, revisiting standard methods (SE, BMC) for standard reachability as 
well as some of their standard companion optimizations. 

Our prototype of Robust Symbolic Execution (RSE) relies on the ability of 
state of the art Satisfiability Modulo Theory (SMT) solvers [4] to generate models 
for universally quantified formulas [25,27,44], which comes with a performance 
and completeness cost—yet we report promising results. 


Contributions. We claim the following contributions. 


— We formally introduce the concept of robust reachability (Sect. 4) and moti- 
vate its use (Sect. 2), giving practical examples where standard reachability 
leads to false positives in practice (whatever the underlying verification tech- 
nology). We also characterize robust reachability in terms of temporal logic 
and hyperproperties, and compare it with non-interference (Sect. 4); 
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— We revisit Symbolic Execution (SE) [9] and Bounded Model Checking (BMC) 
[13] and show how they can be lifted to the robust case (Sect.5). While 
they both have the same deductive power in the standard case, they do not 
anymore in the robust setting—yet, path merging allows Robust SE to pace 
up with Robust BMC. Finally, we show how to adapt standard optimizations 
for Symbolic Execution and Bounded Model Checking; 

— We implement and evaluate! (Sect. 6) the first symbolic execution engine ded- 
icated to robust reachability, namely BINSEC/RSE. We show how to use it 
for criticality assessment of 4 existing vulnerabilities (CVEs), and compare it 
with standard symbolic execution. RSE appears to be tractable with reason- 
able overhead, yielding false-positive-free symbolic reasoning. 


We believe robust reachability is an important sweet spot in terms of expressive- 
ness and tractability, allowing to highlight serious bugs in practical situations. 
We hope this first step will pave the way to more refinements and applications 
of robust reachability. 


2 Motivation 


In this section we show why standard reachability is not always a good fit for 
bug finding, as it cannot distinguish between replicable bugs and fragile bugs. 


1 void victim() { 
void fill(unsigned n, char* ptr) { 9 /* stack variables, top to bottom */ 
for (unsigned i = 0; i < n; i++) {3 // return address goes here 
ptr[i] = 0x61; 4 int canary = global_random_value; 
J 5 char buffer [8]; 
} 6 /* end stack variables */ 
void victim() { 7 
unsigned n = controlled_input; 8 register unsigned n = controlled_input j 
char buffer [8]; 9 fill(n, buffer); 
fill(n, buffer); 10 if (canary != global_random_value) 
} 11 fail_and_dont_return_at_all(); 
void main() { 12 /* everything is ok */ 
victim(); 13 3} 
} ; ‘ s 
: os sities (b) Explanation of compiler instrumenta- 
(a) C-like code, for simplicity tion with Stack Smashing Protection (SSP) 


Fig. 1. Simple stack buffer overflow 


Stack Canaries. Consider the program presented in Fig. 1. It suffers from a 
stack buffer overflow: if variable n is greater than 8 (the size of buffer), then 
0x61 will be written to stack memory above buffer. For high enough n, this 
will overwrite the return address (Fig. 1b, line 3) of function victim and make 
the program jump to an unexpected program location when victim returns. 


1 The tool, benchmark and data are available at https://github.com/binsec/cav2021- 
artifacts and https://zenodo.org/record/4721753. 
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Mitigations for such programming errors exist, like Stack Smashing Protec- 
tion (SSP) [18]. This technique consists in pushing a randomly-chosen constant 
value called a canary at the top of the stack in the prologue of each function, and 
checking that this value is intact before returning. If the canary has been tam- 
pered with, the program exits to prevent exploitation (Fig. 1b, line 11). Here, 
SSP prevents the attacker from overwriting the return address of victim, as 
doing so also overwrites the canary with 0x61616161. This will be detected at 
line 10 of Fig. 1b with probability 1 — 27%? on a 32-bit architecture: the only way 
to pass through it is to have the canary value equal to 0x61616161. Hence, the 
buffer overflow in this program is not exploitable anymore. 


Table 1. Standard reachability is not a good criterion to measure the protection of 
SSP on the program of Fig. 1. 


Prog. Ground truth | Standard BINSEC [23] Angr [46] Robust BINSEC/RSE 

Fig. 1 reachability reachability 

No SSP | Vulnerable Vulnerable | Vulnerable Y | Vulnerable Y | Vulnerable | Vulnerable 
v v v 

SSP Protected Vulnerable | Vulnerable X | Vulnerable X | Protected Y | Protected v 
x 


The Problem with Standard Reachability. Can the attacker hijack 
the control flow without triggering SSP? We can model this security ques- 
tion as a standard reachability query over inputs controlled_input and 
global_random_value. The attacker succeeds if line 12 is reachable with the 
additional condition that the return address of victim is overwritten with an 
unexpected address. 

Unfortunately, this standard reachability query is satisfiable with the canary 
global_random_value equal to 0x61616161 and controlled_input equal to 
e.g., 42. And indeed, binary-level SE tools Angr [46] or BINSEC [23] do report 
the bug as reachable (cf. Table 1). Yet, this answer is unsatisfying as this only 
happens with a very low probability: it may not be considered a plausible attack. 
Hence, it turns out that SE can yield false positives in practice—especially in a 
security contest. 


Proposal: Robust Reachability. We label controlled_input as a controlled 
input and global_random_value as an uncontrolled input. There exists no value 
of controlled_input such that victim returns to an address tampered with 
independently of the value of global_random_value. We thus say that our 
exploitation condition (line 12) is not robustly reachable. We can automatically 
verify this intuition. We adapted the SE engine of BINSEC to robust reachability: 
our tool finds the vulnerability when we disable the protection (by labelling the 
canary as controlled input) and does not find it anymore when the protection is 
present. This shows that robust reachability can model the protection provided 
by SSP, while standard reachability cannot. 
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This phenomenon is not restricted to stack protectors. We identify in Table 2 
several situations where standard reachability may lead to false positives, unlike 
robust reachability. Note that some cases (randomisation based protections, 
uninitialized reads) concern binary-level issues, and cannot be observed from 
a source-level analysis. 


Discussion. Consider the slightly different problem of reaching line 11 in Fig. 1b. 
It is reachable for all values of the canary except 0x61616161, hence it is not 
considered robustly reachable — all values of uncontrolled input should lead to 
line 11. This restriction is deliberate. A more quantitative approach would hinder 
automation. For similar reasons, we limit ourselves to non-interactive scenarios, 
where the attacker input is chosen before uncontrolled input are known. We will 
further motivate these choices in Sects. 4.1 and 6.4. 

Despite these deliberate restrictions, our case studies (Sect.6.2) show the 
versatility of robust reachability. In the example above, we distinguish inputs 
controlled by an attacker (a bad guy) from inputs which he cannot influence 
(see also e.g.libvncserver in Sect. 6.2). But with doas (Sect. 6.2), we distinguish 
inputs controlled by the system administrator (the good guy) from those which 
vary on each execution. Other situations are possible, for instance deterministic 
inputs versus non-deterministic ones like in the case of flaky tests [42]—-where 
there are neither good nor bad guys. Robust reachability can help in all these 
situations either assessing the “quality” of a given trigger or test suite (criticality, 
replicability), generating “good” triggers or test suites, or proving their absence. 


Table 2. Program constructs for which standard reachability yields fragile input 


Randomisation Standard reachability models randomized or arbitrary values like canaries 
based protections or ASLR as attacker-chosen values. This voids such protections. See also 
Fig. 1 and libvncserver in Sect. 6.2 


Uninitialized reads With standard reachability, the attacker can choose the initial content of 
uninitialized memory. For example he can choose it to contain a password 
or a secret. See also doas in Sect. 6.2 


Underspecified A bug which is unreachable in normal operating conditions can become 
initial state reachable if, e.g., one leaves the stack location completely free. Then the 
bug only happens with pathological initial state 


Undefined behavior | A bug in a branch depending on undefined behavior is still technically 
reachable, but not robustly reachable. Note that even machine code has 
some undefined behaviors 


Interactions with Contrary to robust reachability, standard reachability lets the attacker 
the environment use system calls and interactions by e.g.letting him choose the date to 
nanosecond precision, as if the environment helped him 


Opaque functions One can abstract complex functions (crypto functions, malloc) as black 
boxes returning a fresh, symbolic value. Standard reachability allows the 
attacker to choose these values, yielding fragile triggers 


3 Background 


Consider a program P and S the set of its possible states. Each state s € S is 
labeled by a program location A(s) € £. Execution of the program is represented 
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by a (one-step) successor relation —>€ S x S; its transitive reflexive closure is 
denoted by —*. For a finite trace t € S* and s,s’ € S two states, we write 
s >t s' if t starts with s, ends with s’ and follows —. The initial state so(y) 
depends on the program input y. For a location £ € £ and input y we write y F £ 
if so(y) —* s where A(s) = £. Additionally, for a trace t € S*, we write y F+ £ if 
so(y) >% s where A(s) = Z. We use trace for successions of states and path for 
successions of locations. By abuse of notation, the path corresponding to a trace 
t € S* is X(t) € £L*. For a path 7, we denote its length |r| and we write y F a if 


dt e S*.\(t) =a A y Hi £ where £ is the final location of r. 


Definition 1 (standard reachability). Given a program P, a location £ € £ 
is reachable if 3y. y F £. 


It is often useful to consider the case of reaching a location @ with a state s 
satisfying some predicate ¢. This can be reduced to standard reachability by 
adding if (¢) /*new target*/ at the target location. 


Definition 2 (correctness, completeness). Let V : (P,l)— {1,0} be a ver- 
ifier taking as input a program P and a location £: 


- V is correct when for all P, £, if V(P,£) =1 then £ is reachable in P; 

- V is complete when for all P,£, if £ is reachable then V(P, £) = 1; 

— If V also takes an integer bound as input, V is k-complete when for all integers 
k and P,, if dy. at € S*.|t] < k Ay, £ then V(P,¢,k) =1. 


In general, verifying reachability is undecidable, so verifiers cannot be both cor- 
rect and complete. Correct verifiers can still be k-complete as k-completeness 
can be thought of as completeness for finite-path systems. 


Data: bound k, target £ Data: bound k, target £ 
for path x in GetPaths (k) do get 
if m goes through £ then for path n in GetPaths (k) do 
o := GetPredicate(7) if m goes through £ then 
if Jy.¢ is satisfiable | @:=@V GetPredicate(m) 
then end 
| return true if dy. ¢@ is satisfiable then return true 
end 
return false else return false 
(a) SE (b) BMC 


Fig. 2. Reachability of £ with SE and BMC 


Symbolic Execution (SE) and Bounded Model Checking (BMC). SE [9] 
incrementally explores all paths in the program (up to, say, a bound k) and 
when an explored path reaches the target location Z, checks that this path is 
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indeed executable. This is performed by converting a path m to an SMT formula 
pc,, called path constraint, which has input y as its only free variable and is 
equivalent to y F 7, i.e., a path is executable if and only if its path constraint is 
satisfiable. Conversely, BMC [13] considers the program as a whole and builds 
a SMT formula expressing that one of the paths of length at most k leads to 
£. It is equivalent to the disjunction of the path constraints of these paths. The 
target is reachable in k steps at most if and only if this formula is satisfiable. 
These algorithms are detailed in Fig.2, where GetPredicate turns a path 
into its path constraint and GetPaths(k) yields all paths below size bound k. 


Proposition 1. SE and BMC have the same expressive power: both are correct 
and k-complete. 


Interestingly, we show in Sect. 5 this is not true anymore with robust reachability. 


Solvers. SE and BMC commonly discharge their satisfiability queries to SMT 
solvers [4] which take formulas as input, and output whether they are satisfiable 
(along with a model) or not. Typical queries are expressed in the quantifier-free 
fragments of well known theories (linear integer arithmetic, bitvectors, arrays, 
etc.) where SMT solvers perform well in practice. In case of an undecidable 
theory, we can use incomplete solvers (possibly answering UNKNOWN), at the 
price of k-completeness. 


4 Robust Reachability 


4.1 Definition 


We introduce the new notion of robust reachability. We partition the input y into 
the controlled input a and the uncontrolled input «—we denote y £ (a,x). Let 
A and X be the sets of possible controlled and uncontrolled inputs respectively. 
A location is robustly reachable when the attacker can choose controlled input 
a E€ A without having to rely on specific values of the uncontrolled input z € ¥ 
to reach his target. Input a is then called a robust trigger—otherwise it is a 
fragile trigger. 


Definition 3 (Robust reachability). A location £ € £ is robustly reachable 
if Ja. Yx. (a,x) F L. This definition depends on the partition of inputs. 


Proposition 2. Robust reachability implies standard reachability. The converse 
implication does not hold. 


Discussion. As already mentioned at the end of Sect. 2, our definition of robust 
reachability specifically targets a threat model where the attacker speaks first, 
unaware of uncontrolled inputs. It deliberately excludes interactive systems 
where the attacker can choose some input, then receive some program output 
possibly leaking uncontrolled input, and then choose some more input depend- 
ing on what was received. Modeling such situations requires additional quantifier 
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alternations, which deeply impact the performance of proof methods and cripple 
automation, as shown in Sect. 6.4. 

Likewise, a bug triggered for all uncontrolled inputs but one is not robustly 
reachable according to Definition 3. A quantitative definition of robust reach- 
ability could take into account the proportion of uncontrolled inputs triggering 
a bug. This hints at works about model counting [11,39], but the problem at 
hand is actually harder. Consider the following alternative definition: (i) find 
Qmax E€ A such that a maximal proportion of uncontrolled inputs x lead to 
L: (amax; £) F & (it) measure how robustly £ can be reached by computing 
the proportion of uncontrolled inputs x such that (amax, x) F Z. Current model 
counting algorithms can only tackle problem (ii) along one path, and we argue in 
Sect. 6.4 that even (ii) alone is considerably more expensive than our SMT-based 
approach. 

In other words, Definition 3 is a tradeoff to keep robust reachability amenable 
to automated verification. This does not prevent it from meeting its main goal: 
drawing the attention on more serious bugs. Some may of course be missed, but, 
as our case studies will show (Sect. 6), a good number will be found. 

In the rest of this section, we review a few related properties and see how 
much they overlap with, but do not remove the need of, robust reachability. 


4.2 Relation with Non-interference 


We partition inputs and outputs of a system into either high (highly classified) 
or low (public, e.g. observable). A system satisfies non-interference [31] when low 
outputs do not depend on high inputs, implying that secrets cannot leak. Robust 
reachability can be reformulated in a very non-interference-sounding phrasing: 
uncontrolled inputs (call them high) must not interfere with the attacker reaching 
the target location (the low output). Let us clarify this link. 

Formally, let high input be uncontrolled input x, and low input be controlled 
input a. Let low output be whether control flow reached location @. Non inter- 
ference of the resulting system means that Va, x, z’. ((a,x) F £ <= > (a,2’)F £). 


Proposition 3. If £ is (standardly) reachable and the system satisfies non- 
interference with the high/low partition described above, then £ is robustly reach- 
able. The converse is false. 


Robust reachability requires a single value of the controlled input a for which 
reachability of @ is guaranteed but says nothing for other values of a, whereas 
non-interference constrains the system to behave much more independently of 
uncontrolled input than robust reachability but says nothing of reachability. 


4.3 Interpretation in Terms of Hyperproperty 


Robust reachability and its negation are not trace properties: the observation of 
a Single trace is never enough to prove or disprove them. For example, observing 
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a single trace reaching target Z with input (a,x) is both compatible with £ being 
robustly reachable (if all other inputs (a,x’),2’ € Æ also reach £), and with @ 
not being robustly reachable (if some other 2’ is such that (a, x’) does not reach 
£). Robust reachability and its negation thus belong to the more general class of 
hyperproperties [16], i.e. statements relating several traces. 

More specifically, Clarkson et al. [16] show that any hyperproperty is the 
intersection of a hypersafety hyperproperty (i.e.something bad cannot hap- 
pen) and a hyperliveness hyperproperty (something good will eventually hap- 
pen). Hypersafety is generally thought as easier to prove, notably with self- 
composition [6]. Unfortunately, robust reachability and its negation are pure 
hyperliveness in the general case: no finite set of finite traces can falsify them. 
However, in some conditions, they degenerate partly into hypersafety: 


Proposition 4. Ifthe domain X of uncontrolled inputs is finite, then the nega- 
tion of robust reachability is not pure hyperliveness (i.e., it has a non-trivial 
hypersafety component). 


Proof. Robust reachability of 2 can be proved by finding controlled input a € A 
such that for all uncontrolled input x € ¥ one observes a trace starting with 
input (a, x) and reaching £. When X is finite, this means that a finite observation 
can disprove non-(robust reachability). This is the definition of hypersafety. 


This idea—trying to observe a hopefully small set of traces which together prove 
robust reachability—is crucial for algorithms and leads to our use of path merg- 
ing in Sect. 5.3. 


4.4 Interpretation in Terms of Temporal Logic 


Computational Tree Logic (CTL). CTL [14] is a temporal logic over the tree 
of possible traces. Let L be a labeling which maps states to the set of (atomic) 
predicates they satisfy. If Z is a predicate, the CTL formula @ is satisfied by all 
systems whose initial state so verifies 0 € L(so). If ọ is a CTL formula and s a 
state, then EX¢ expresses that ¢ holds in at least one (direct) successor of s, 
and AF¢ that all traces arising from s eventually reach a state from which ¢ 
holds. CTL introduces other operators, not needed here. 


Proposition 5. It is possible to express robust reachability with CTL. 


Proof. Let S' S U AU {s;} where s; is a new state, let ~’=— U{(s;,a) | a € 
A} U {(a, so(a,x)) | a E€ A,x E€ X}, and let L’(s) be equal to L(s) if s € S and 
Ø otherwise. Then £ is robustly reachable if, and only if EXAF‘? is true in the 
new extended system (S’,—>’, L’) with s; as initial state. 


HyperLTL. It is also possible to express robust reachability in the temporal 
logic HyperLTL [15], which allows to reason over sets of traces 7, assuming we 
have an atomic predicate =, stating that the first states of two traces have the 
same value for variable v. Robust reachability of £ can then be expressed as 
Ir. Yn’. Fel, A(T =a T — Fly), where Fl, denotes that trace 7 goes through £. 
In other words, there exists a trace 7 reaching £ s.t. all traces sharing the same 
controlled input also reach £. 
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4.5 Robust Reachability and Automatic Verification 


The previous classification does not help us find an efficient software verification 
method for robust reachability. Indeed, while efficient CTL model checkers exists 
for the finite case [12] or very specific formalisms such as pushdown systems 
[47], most efforts in (general) software verification have been directed towards 
the verification of safety temporal formulas or simple termination [17] (formulas 
of the form AF). Moreover, temporal logics like HyperLTL [15] suffer the same 
limitations, and checking for both reachability and non-interference is probably 
too strong a requirement in practice. Finally, one can prove the absence of robust 
reachability by proving the absence of standard reachability. It is thus possible to 
use existing algorithms for unreachability, based e.g.on invariant computation, 
at the price of even larger over-approximation than when they are used for their 
original purpose. This kind of approach is not our focus. In this paper we look 
for correct verifiers able to prove robust reachability (and report robust triggers) 
rather than to disprove it. 


5 Automatically Proving Robust Reachability 


We now extend SE and BMC to the robust case. 


5.1 Robust Bounded Model Checking 


As mentioned in Sect.3, BMC determines the reachability of a location @ by 
building a family of SMT formulas y;(a,2) equivalent to dt € S*. |t| < kA 
(a,x) Fy l. pp expresses that Z is reachable in less that k steps. Then one proves 
that £ is reachable if and only if 4k. Ja. Jx. pp(a, x). This extends to robust 
reachability: 


Proposition 6. Ifthe domain of uncontrolled input X is finite or the system has 
finitely many paths, then £ is robustly reachable if and only if 3k. da. Vx. pla, x). 


Proof. (<=) comes directly from the definition of yy. ( => ). If £ is robustly 
reachable, let ag be a robust trigger. The set of paths P arising from inputs in 
{ao} x X is finite (bounded either by ¥ or the number of paths in the system), 
and Vx. V ep Pc,(ao,2) holds. Let k = 1+max,ep|z|. All paths in P are 
unrolled in yy, so Ve p PC, (ao,%) => pk(ao, x) and thus Vx. yz (ao, £). 


As a result, it is enough to replace the condition “dy. ¢ is satisfiable” by 
Ja. Vx. $ is satisfiable” in Fig. 2b. 


Corollary 1. The resulting algorithm, robust BMC, is correct w.r.t.robust 
reachability. If the domain of uncontrolled input X is finite or the system has 
finitely many paths, then robust BMC is also k-complete. 


The finiteness hypothesis is required: if a program reaches a location after 
having executed a loop an unbounded, uncontrolled number of times, then robust 
BMC has to unroll an unbounded number of paths to prove robust reachability. 
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5.2 Robust Symbolic Execution 


Similarly to BMC, we check that a path m robustly reaches the target by check- 
ing the satisfiability of Ja. Vx. pc,(a, x), instead of Ja. dx. pc,(a, x). This means 
replacing “Jy. @ is satisfiable” by “Ja.Yx.¢ is satisfiable” in Fig. 2a. Unfortu- 
nately the resulting algorithm, robust SE, is not exactly what we want, as it 
proves a stronger property. 


Definition 4 (Single-path robust reachability). A location £ € £ is single- 
path robustly reachable if 3n € L*. da. Va. dt E S*. A(t) = TA (a, x) F: £. In other 
words, the path used to reach £ is the same regardless of the uncontrolled input. 


Proposition 7. Single-path robust reachability implies robust reachability. The 
converse implication does not hold. 


Proposition 8. Robust SE is correct and k-complete w.r.t.single-path robust 
reachability. 


Proof. By construction, pc,(a,z) is equivalent to (a,x) F m so Jr.Ja. 
Vz. pc,(a, x) is equivalent to single-path robust reachability of the last location 
of T. 


Corollary 2. Robust SE is correct but incomplete for robust reachability. 


Interestingly, the expressive powers of SE and BMC, which are the same for 
standard reachability, diverge when extended to robust reachability. 


5.3 Path Merging 


Path merging [33] (a.k.a. state joining) consists in identifying “close” paths lead- 
ing to the same location and replacing them by a merged path (summary). 
With original path constraints pc,, and pc,,, the merged path constraint is 
PCr, Vpc,,,. This is only an optimization in the standard setting, with no impact 
on k-completeness. The situation is different in the robust setting. 


Data: bound k, target £ 
1 p= 


2 for path x in GetPaths (k) do 

3 if m goes through £ then 1 |void main(a, x) { 

4 $ := ġ V GetPredicate(7) 2 if (x) x++; // m 
5 if Ja. Yx. ¢ is satisfiable then 3 else x--; // T2 
6 | return true 4 

7 end 5 if (!a) bug(); 

8 return false 6 |} 


Algorithm 1: RSE+: Robust SE Fig. 3. An example where path merging 
with systematic path merging is required 
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Consider the program in Fig.3: the bug is robustly reachable with con- 
trolled input a = 0, but the control flow takes one of two paths mı and 72 
depending on the value x of uncontrolled input. This bug will not be found by 
robust SE as defined previously, as neither mı nor mə fulfills the satisfiability 
criterion Ja. Vx. pcp, (a, £). However, if 7; and m2 are merged, then the bug is 
found because Ja. Vx. pc,, (a, ©) V PCr, (a, £) is satisfiable. This leads us to robust 
SE with systematic path merging (RSE+, Algorithm 1), better fit to robust 
reachability. 


Proposition 9. Robust SE with systematic path merging (RSE+) is correct for 
robust reachability. If the domain of uncontrolled input X is finite or the system 
has finitely many paths, then it is also k-complete. 


Proof. For k-completeness: If Z is robustly reachable, let ag be a robust trigger. 
The set of paths P arising from inputs in {ao} x Æ is finite (bounded either by 
X or the number of paths in the system). Let k = 1 + maxņep |z|. For bound 
k, when GetPaths has output all paths in P, V,cppc, => so da.Va. ¢ is 
satisfiable. 


In conclusion, path merging improves the completeness of robust SE. This is 
surprising because path merging is merely optional in standard SE. 
5.4 Revisiting Standard Optimizations and Constructs 


Some optimizations commonly used in SE are not correct nor complete anymore 
in a robust setting. We show here how to adapt them. 


Data: program entrypoint lo, bound 
k 
1 P:= {lo} uncontrolled int x; 
2 while P 4 Ø do if (x<10) { /* a */ } 
3 Take a path a out of P else { /* b */ } 
4 if |x| > k then continue /* 0 */ 
5 if Ja, x. pc, unsat then if (x>20) { 
continue J* d */ 
6 yield 7 if (x>30) { /* e */ } 
7 P := PU {children paths of 7} else { /* f */} 
8 end } 
Algorithm 2: Implementation of Fig. 4. Failure case for universal path 
GetPaths with path pruning pruning 


Incremental Path Pruning [3,48]. When a path has an unsatisfiable path 
constraint, all its descendent paths are also infeasible. For example, the path 
acd in Fig.4 has path constraint x < 10 A x > 20, which is unsatisfiable. One 
can prune this path, i.e.stop exploring it and its children acde and acdf. 
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Data: entrypoint o, bound k 
P= {fo} 
while P 4 Ø do 1 Function MaybeMerge(z, P) 
Take a path m out of P 2 Choose u a transitive child of the 
if |x| > k then continue last location of m (ideally, a strict 
if Ja. Yx. pc, unsat then postdominator of the second to 
/* Skip MaybeMerge to last location of m) 
disable path 3 Let 7’ the longest strict prefix of 7. 
merging */ 4 Let U the set of paths from 7’ to u 
P := MaybeMerge(m, P) 5 if Ja. Yz. Viney’ is SAT then 
continue 6 Merge paths in U and add the 
end result to P 
yield a 7 end 
P= 8 return P 
PU {children paths of 7} Algorithm 4: Incomplete path 
end merging for universal path pruning 


Algorithm 3: GetPaths with 
universal path pruning 


In Fig. 2a this would be an optimization of GetPaths: as shown in Algorithm 
2, one checks that the path constraint of currently explored paths are satisfi- 
able, and if not, the paths at fault are pruned, and their children paths are not 
explored. As a result, we now issue satisfiability queries in two occasions: during 
GetPaths to prune paths (Algorithm 2, line 5), and when validating a candidate 
reaching path (Fig. 2a, line 5). Pruning queries and validation queries must be 
treated differently. 

Robust SE is obtained from SE by adding a universal quantifier to valida- 
tion queries but not pruning queries. The path constraint for path a in Fig. 4 
is pc, = x < 10 but da.Va.pc, is false. Same applies for b. If we added a 
universal quantifier to pruning queries—which we call universal path pruning, 
see Algorithm 3—we would prune a and b, and incorrectly conclude that c is 
not robustly reachable. In other words, Symbolic Execution with universal path 
pruning (denoted RSEy) is correct but not complete. 

Universal path pruning, however, conveys an interesting intuition: the full 
if branch below acd in Fig.4 is not robustly reachable, because Vz. x > 20 is 
false. With normal path pruning and RSE+, we would needlessly explore these 
paths. To take advantage of this, we keep RSEy but improve its completeness 
with path merging, as depicted in Algorithm 4. 

The main idea is that when a set of paths are to be pruned, they may pass 
the universal pruning test da.Vx.pc when merged together. One way to find 
such sets of paths is the use the Control Flow Graph (CFG) of the program. For 
example when trying to prune 7 = a in Fig. 4, we know by invariant of the set 
P of paths to be explored that 7’ = e the empty path passes the universal test. 
We compute the strict postdominator u = c of 7’: when the paths from 7’ to c 
join again, they pass the pruning test again. We then replace m by this merged 
path in the set P of paths to be explored. 
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Note that computing a postdominator is not required for correction. In our 
implementation, we cannot compute the exact CFG at the binary level so the 
chosen u may be wrong. In line 5 of Algorithm 4 we check that we picked cor- 
rectly, and otherwise, merging failed and we prune 7. Despite the heuristic app- 
roach, the technique proves useful, as we will see in Sect. 6. 

We denote Robust SE with universal path pruning and path merging as 
RSEy-+. It is correct and less incomplete than RSEy. 


Assumptions. It is common to model 
complex parts of the system by intro- | Controlled er tT ee ee 
ducing their result as a symbolic input | uncontrolled unsigned int x; 
z and then assume that z satisfies | assume(x < a); 

the required properties. For example, |if (false) bug (); 

Address Space Layout Randomisation 
(ASLR) for the stack pointer could be 
modeled by adding an assumption that Fig. 5. Unsound assumption, in pseudo-C. 
esp € |m, M] where m and M are in- 

lined constant values. In standard SE this would be translated to an assertion 
espo € [m, M] conjoined to the path constraint pc,, where espo is the initial 
value of esp. Actually, in standard SE and BMC, assertions and assumptions 
are dealt with identically. 

In a robust setting, to the contrary, adding an assumption p to a path con- 
straint yields 7% => pc,, while adding an assertion ọ yields pc, A ¢. Addi- 
tionally, assumptions which mix controlled and uncontrolled inputs can make 
the algorithms above unsound without adaptation: in Fig. 5, reachability of bug 
maps to the SMT query Ja.Yx.x < a => Ll. It is satisfiable, with a = 0, 
which makes the premise false. However, this does not correspond to an exe- 
cutable path. Actually, formalizing robust reachability assuming w(a, x) naively 
by da. Va. (w(a,xz) => a,x F £) does not imply standard reachability anymore. 
A slight adaptation is needed: 


Definition 5 (Robust reachability under assumption). A location £ is 
robustly reachable under the assumption of w when 


da. (Gx. w(a,x)) A (Va. (Y(a, x£) => (a,x) 2£))) 


This definition preserves the implication from robust to standard reachability. 
The algorithms we presented are easily adapted to take it into account. 

Interestingly, in the robust case, SE and BMC cannot handle assertions and 
assumptions in the same way anymore. 


Concretisation and Other Optimizations. When path constraints along a 
path become too complex, some variables can be concretized: their symbolic 
value can be replaced by a concrete one [21,29,45]. Formally, concretizing a 
variable u to value 42 corresponds to adding an assertion u = 42. This sacrifices 
k-completeness for tractability. Actually, any additional constraint can be added, 
and several common optimizations (e.g., domain shrinking, path filtering) can be 
seen through this lens. These optimizations must be taken with care in the robust 
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setting. First, considering them as assumptions instead of assertions would be 
incorrect. Second, if the value of the concretized variable ultimately depends 
semantically on uncontrolled input, the path does not pass universal validation 
anymore: for example, when concretizing x to 42, Ja.Yx.pc(a, x) A x = 42 is 
unsatisfiable because Yz. x = 42 is false. As a result, locations visited further 
on this path become robustly unreachable. In other words, concretisation only 
works on controlled or constant values. 


5.5 About Constraint Solving 


Adaptations to robust reachability require solvers to deal with one alternation 
of quantifiers. Most theories become undecidable with quantifiers. Dedicated 
algorithms exist for a few decidable quantified theories, e.g.the array property 
fragment [7] or Presburger arithmetic [8]. For other theories, generic methods 
like E-matching [40] and MBQI [27] have proven rather efficient, although not 
complete. Sound approximations [25] also have been proposed to reduce quan- 
tified formulas to quantifier-free ones. In our experiments, the newly introduced 
quantifier associates to an increase in the frequency of time-outs and memory- 
outs, as seen in Sect. 6.3 and specifically Table 4. 


6 Proof-of-Concept of a Robust Symbolic Execution 
Engine 


6.1 Implementation 


We propose BINSEC/RSE, the first symbolic execution engine dedicated to 
robust reachability. We base our proof-of-concept on BINSEC [23], a binary 
executable formal analysis engine written in OCaml and already used in sev- 
eral significant case studies [19,20,43]. For the sake of experimental evaluation 
(Sect. 6.3) we actually implement five variants of robust reachability: RSE (basic 
approach in Sect. 5.2 with existential path pruning Sect. 5.4), RSE-+ (the same 
plus systematic path merging, Sect.5.3), RSEy (RSE with universal path prun- 
ing, Algorithm 3), RSEy+ (same, with path merging during path pruning, 
Algorithm 4), and RBMC (Sect.5.1). BINSEC/RSE emits quantified formu- 
las in the theory of bitvectors and arrays (arrays are used to model memory) 
which are then solved by the quantified solver Z3 [22]. We reuse the recent ROW 
simplification [26] to reduces the number of array indexations. The source code 
of BINSEC/RSE, the test suite and the case studies of this section are available 
for reproduction at https://github.com/binsec/cav2021-artifacts and https:// 
zenodo.org/record/4721753. 


6.2 Case Studies: Exploitability Assessment for Vulnerabilities 


We show here how BINSEC/RSE (unless otherwise specified, the RSE+ variant) 
can help in vulnerability assessment. Especially, we demonstrate that robust 
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reachability allows deeper insights into a bug than standard reachability, by 
replaying 4 existing vulnerabilities. 


CVE-2019-15900 in doas. doas is a utility granting higher privileges to users 
specified in a configuration file. User IDs are sometimes parsed incorrectly and 
left uninitialized. We look for a vulnerable configuration file denying root access 
to the attacker such that the (flawed) executable reliably grants root access to 
the attacker. For simplicity we assume that the system has no named users and 
groups and the configuration file has two lines. 

BINSEC/RSE with standard reachability reports that root access is granted 
with a configuration file containing permit :("@@@@@ when the initial memory 
address Oxffefffff contains the group ID of the attacker and the stack starts 
at Oxfff0001f. This is a typical “false positive in practice”: these conditions 
may vary unpredictably across executions so we cannot conclude regarding the 
exploitability of the flaw. 

With robust reachability where the configuration file is controlled but the 
initial state of memory is not, BINSEC/RSE reports in less than 10s that root 
access is granted reliably to the attacker when the configuration file contains deny 
:4 and permit b%@)@@(. This is more useful, but bZ@) @@( We test therefore if 
any other given user name is also affected by running the analysis with this user 
name concretized in the initial state. By this method, we proved that the flaw 
is also robustly reachable for wwww, a possible typo of a usual user name, as well 
as all two-letter lowercase user names. 

In other words, if the system administrator grants privileges to a non existing 
user by mistake, he may unknowingly grant them to the attacker instead. Here, 
robust reachability provides us with invaluable insight about the severity of a bug 
where standard reachability fails. 


CVE-2019-20839 in libvneserver. An attacker-chosen null-terminated string 
is copied by an unbounded strcpy into a 108-bytes buffer, leading to a stack 
buffer overflow. Exploitability is not guaranteed: null bytes cannot be copied, the 
executable is protected by SSP, etc.. Starting from the vulnerable function, we 
ask whether it is possible to return to the address Oxdeadbeef, chosen arbitrarily. 

BINSEC/RSE reports that for standard reachability, the bug can be reached 
when: (1) the stack starts at Oxff££00000; (2) the initial value of the return 
address of the function is 0; (3) the gs segment starts at Oxf7£00000; (4) the 
stack canary is 0x01010180; (5) neither system call in the function fails; (6) file 
descriptor 0 is free; (7) the input path has a specific value. The attacker cannot 
prepare such a state, so this is another false positive in practice. 

With robust reachability, when only the input buffer is controlled and not the 
stack canary, BINSEC/RSE fails to prove or disprove exploitability in 24h. How- 
ever, if we mark the canary as controlled, BINSEC/RSE finds an exploit in about 
15 min. This suggests the canary brings a real protection against exploitation. 


CVE-2019-14192 in U-boot. U-boot is an open-source boot-loader, popular 
for embedded boards. When booting over Network File System (NFS), U-boot 
does not validate the length field of some network packets. This length is sub- 
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tracted 16 and used as a size to be copied. If a malicious packet declares a length 
of less than 16, computation underflows and leads to a buffer overflow. 

We encode the situation as follows: the input network packet is controlled, 
the IP address of the victim is constant, the NFS state machine is initialized 
to expect the appropriate packet type and all other values are uncontrolled. 
BINSEC/RSE with the RSEy+ variant (RSE+ times out here) proves in about 
2min that a memory copy of more than 4GB is robustly reachable, which is a 
strong indication of the criticality of this denial-of-service vulnerability. 


CVE-2019-19307 in Mongoose. Mongoose is an embedded networking 
library. When receiving large MQTT packets, the length of the parsed packet 
can be computed as 0. The parsing loop does not advance and is thus infinite. We 
look for network packets whose length is parsed as 0 but are accepted as valid. 
BINSEC/RSE proves in less than a second that such situations are robustly 
reachable when only the network packet is controlled, confirming exploitability. 


6.3 Experimental Evaluation 


Research Questions. We now seek to investigate in a more systematic way 
the following research questions: 


Table 3. The 46 reachability problems selected for our evaluation 


Type Description Controlled variable 

Real Vulnerability CVE-2019-14192 (U-boot) Network packet 
CVE-2019-20839 (libvncserver) Socket path 
CVE-2019-19307 (mongoose) Network packet 
CVE-2019-15900 (doas) Configuration file 
CVE-2015-8370 (grub, simplified) Password entry 

CTF Flare-on 2015 1 & 2 Text entry 

Nintendo Coding Game Input to hash 


function to invert 


Manticore Text entry 


Function inversion | musl (strptime, strverscmp, atoi, strtol) | Preimage 
busybox (chmod mode and ip parsing) 
uclibc (fnmatch) 

openssl (base64 decoding) 


Synthetic | Motivating example of [25] and variants Coefficients to affine 
function 
Motivating example of [24, Figure 2.2] Text entry 
SSP bypass See Sect. 2 Overflowing buffer 
ASLR bypass 2 examples Various 
Undefined behavior | Overflow flag after 3-bit shl in x86 None 


Other Various Various 


686 G. Girol et al. 


RQ1 Precision: What is the best algorithm for robust reachability in terms of 
correctness and completeness? 

RQ2 Gain associated to robustness: Is standard SE subject to false positives 
and does robust reachability avoid them in practice? 

RQ3 Path pruning: Does universal path pruning (Sect.5.4) help explore less 
paths than normal path pruning? 

RQ4 Performance: What is the overhead of robust reachability? 


Protocol. We base our analysis on a set of 46 reachability problems on binary 
executables from various architectures (i686-windows-pc, i686-linux-gnu and 
armv7-linux-gnu) presented in Table3. The average trace length for reachable 
problem instances is 809 instruction-long, with a maximum of 18k instructions. 
The problems fall into two categories: real code and synthetic examples (e.g.code 
designed to be analysed). For each executable, BINSEC/RSE determines if a cer- 
tain location is robustly reachable from a certain initial state. If this is the case 
a model is output by BINSEC/RSE, and compared to a ground truth obtained 
by manual analysis. Tests were run on Intel Xeon E-2176M(12)@4.4 GHz and we 
use Z3 4.8.7. Results are classified as follows: 


Correct BINSEC/RSE proves the expected result, i.e. it either reports a robust 
trigger or rightfully proves the absence of such a trigger; 

False positive a fragile trigger is reported; 

Inconclusive BINSEC/RSE reports no trigger but search was incomplete or the 
solver returned UNKNOWN at some point; 

Resource exhaustion timeout is an hour and memory usage is capped to 7 GB. 


Table 4. Comparison of standard and robust algorithms over our 46 test cases 


SE |BMC RSEy | RSEy+|RSE | RSE+ | RBMC 
Correct 30 22 30 34 37 44 32 
False positive 16 14 
Inconclusive 16 11 7 1 
Resource exhaustion 10 1 2 2 13 
Total time (s) 2725 | 36911 | 3947 | 4374 13590 | 11534 | 47784 
... w/o resource exhaustion | 2725|911 | 3947 | 3589 6390 |4334 | 984 


Precision (RQ1). As expected, robust variants do not report any false pos- 
itives, and path merging increases completeness. RSE variants with universal 
path pruning (RSEy, RSEy+) are less complete than those with existential path 
pruning, but they are less prone to timeouts. This is the case of CVE-2019-14192 
in U-boot (Sect. 6.2), for example. RBMC suffers from path explosion (time out) 
much more often than RSE variants. Overall, Robust SE with path merging and 
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existential path pruning is the most promising method among those presented 
here, with 44/46 correct answers. RSEy+ is less complete but terminates more 
often. 

Note that two interesting test cases in the “real” category of Table 3 need 
path merging to prove robust reachability: one where a pointer with uncontrolled 
alignment is passed to memcpy, and one where a branch depends on the result 
of IO. These situations are common programming idioms, demonstrating the 
importance of path merging. 


4 


Gain Associated to Robustness (RQ2). We compare standard SE with 
RSE+, the most precise algorithm of RQ1. Standard reachability has about 30% 
false positives while robust reachability has none, at the cost of slightly more 
timeouts. 

There are no false positives in code in the “real” category, except in CVE 
replays. Our interpretation is that well-functioning programs are designed to 
behave the same regardless of the uncontrolled environment: concrete mem- 
ory layout, stack canaries, etc.. Robust reachability becomes decisive on buggy 
code, notably with undefined behavior. This is also illustrated by case studies 
(Sect. 6.2). 


Path Pruning (RQ3). We compare RSEy, which features universal path prun- 
ing, to RSE, which features usual path pruning. Comparison is limited to test 
runs of more than a second which succeed with both methods. This is to prevent 
comparing a run where BINSEC/RSE proves that the target is reachable and 
stops, to a run where BINSEC/RSE does not find the target and explores the 
whole program. RSEy explores 17% less paths and interprets 21% less instruc- 
tions than RSE. This comes at the price of more universally quantified SMT 
queries: the average time per SMT query goes up by 25%. Overall the run time 
of both methods is very close. 

With path merging, the difference in paths explored disappears: RSEy4 
explores 1% less paths and instructions than RSE+. This is due to the fac 
that for some tests, path merging “unlocks” some new paths. Overall, RSEy- 
is 6% slower than RSE+ on successful, terminating tests. 


= 


Performance (RQ4). In this question, we compare the run time of robust 
algorithms to SE. Comparison is done on the same basis as before, except that 
we count timeouts. RSE+ is 74% slower than standard SE on geometric aver- 
age. This is mostly due to newly introduced time-outs (up to 260x slower) since 
median slowdown is only 15%. RSEy is more consistently slower with about 30% 
slowdown in both geomean and median. This is mainly explain by increased 
solver time (universal path pruning queries). RSEy+ is close in median slow- 
down, but path merging introduces new timeouts and drives the average slow- 
down up to 62%. RSE+ has a low overhead compared to standard SE, except for 
a few time-outs (2/46). 
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6.4 Additional Considerations 


We excluded interactive systems and quantitative approaches from our definition 
of robustness (Definition 3, Sect. 4.1) to keep automated proof methods tractable. 
We motivate this choice by experimentally showing that these alternatives yield 
significant overhead. Technical details are provided in Appendix A. 


Quantitative Reasoning and Model Counting. We could imagine refining 
our definition of robust reachability, looking for some controlled input for which 
the number of uncontrolled inputs allowing to reach the intended target is max- 
imal (or, above a certain threshold). Although we have already observed that 
model counters do not directly solve this problem (Sect. 4.1), we can lower bound 
its runtime cost by the cost of determining the number of uncontrolled x satis- 
fying a path constraint for some given controlled input ap. We experimentally 
measured it with SearchMC [39] and SMTApproxMC [11], two of the few model 
counters supporting the SMTlib2 format and the QF_BV theory. We compare 
this to our “all-or-nothing” qualitative approach on our 4 CVE case-studies: the 
quantitative approach is here several orders of magnitude slower than our qual- 
itative method—SMTApproxMC always times out while SearchMC is at least 
400x slower. 


Interactive Systems and Quantifier Alternations. We estimate the cost 
of adding more quantifier alternations in order to deal with interactive systems 
(Sect. 4.1), by modifying queries on the two of our case studies where interactive 
input makes sense (libvncserver and doas, cf. Sect.6.2). RSE+ in this setting 
does not terminate within 24h, highlighting the fact that current SMT solvers 
have a very hard time generating models for quantified formulas beyond JV. It 
seems to be a fundamental issue as none of Z3 [22], Boolector [41] and CVC4 [5] 
is able to prove in less than 1h that Vz.da.a XOR 1 = z holds over 32-bit 
bitvectors. 


7 Related Work 


Broadly speaking, we are interested in defining a subclass of comparatively more 
interesting bugs amenable to automation. We review related prior attempts. 


Automatic Exploit Generation (AEG). These approaches seek to demon- 
strate the impact of a bug by automatically generating an exploit from 
it [1,10,36]. This is complementary to robustness, which focuses on replicabil- 
ity. Actually, both techniques could be advantageously combined, as a replicable 
exploit is clearly more threatening than a fragile one. Current AEG methods 
being based on symbolic methods, adapting them for robustness looks feasible. 


Quantitative Reasoning and Model Counting. Several approaches rely on 
probabilities or counting to distinguish important issues from minor ones—for 
example (quantitative) probabilistic model checking [2,34] or quantitative infor- 
mation flow analysis [37]. Robust reachability could be refined in such a way. 
Yet, current quantitative approaches do not scale on software, as they often rely 
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either on the finite-state hypothesis, or on model counting solvers [32], which are 
only at their beginning (see Sects. 4.1 and 6.4). 


Flakiness. The opposition between flaky tests and sturdy tests [42, section 6.3] 
is close to that between robustly reachable bugs and normally reachable bugs. 
A test is flaky when it is reachable, but not robustly reachable under the parti- 
tion of inputs where controlled inputs are deterministic inputs and uncontrolled 
inputs are non-deterministic inputs. Flakiness is thus a particular case of (non-) 
robustness. Especially, our tool can help find non-flaky tests. 


Fairness. Fairness assumptions in model checking [35] aim at discarding traces 
considered as unrealistic and avoiding false alarms from the user point of view. 
While the goal is rather similar to ours, the two techniques are very different: 
fairness assumptions typically require certain sets of states to be visited infinitely 
often along a trace, while robust reachability requires that a trace cannot be 
influenced by uncontrolled input w.r.t.a given reachability property. 


Symbolic Execution and Quantifiers. Finally, while symbolic execution is 
commonly performed with quantifier-free constraints, a notable exception is 
higher-order test generation [28], where Godefroid proposes to rely on universally 
quantified uninterpreted functions (V4 queries) in order to soundly approximate 
opaque code constructs. Higher-order test generation and robust reachability 
are complementary as they serve two different purposes: robust reachability can 
only be used in a modest way for opaque code constructs (finding controlled 
inputs for which their value does not matter), while higher-order test genera- 
tion is inadequate for robust reachability, as it would be as if the attacker could 
choose the controlled inputs knowing the uncontrolled ones. 


8 Conclusion 


We introduce the novel concept of robust reachability, that we argue is better 
suited than standard reachability in several important scenarios for both security 
(e.g., criticality assessment, bug prioritization) and software engineering (e.g., 
replicable test suites). We formally define and study robust reachability, discuss 
how standard symbolic methods to prove reachability can be revisited to deal 
with the robust case, design and implement the first robust symbolic execution 
engine and demonstrate its abilities in criticality assessment over 4 CVEs. We 
believe robust reachability is an important sweet spot in terms of expressiveness 
and tractability. We hope this first step will pave the way to more refinements 
and applications of robust reachability. 


A Details on the Experiments Supporting Sect. 6.4 


We reuse the notations of the discussion in Sect. 4.1. 


Model Counting. For simplicity, consider single-path robust reachability of £ 
along a path with path constraint pc(a, x). It is equivalent to Ja. Vx. pc(a, x). 
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A more quantitative approach would be to consider max 8.t.the ratio r(amax) 
of x satisfying pc(@max,2) is maximal. The larger r(dmax), the more robustly 
reachable /. We try to experimentally get an idea of the cost of computing 
this. Determining @max is an open problem, but we can lower bound the full 
computation time by the time to compute r(@max) from max. As the algorithms 
below are randomized, we can measure the time to compute r(ao) for any ao. 

We collect the path constraint of the first path standardly reaching the target 
in our 4 case studies of Sect. 6.2. We arbitrarily choose ao satisfying Ix. pc(ag, x), 
and compare the time to (dis)prove Vx. pc(ao, x) with Z3 to the time to approxi- 
mate r(ao) with two of the few model counters supporting SMTlib2 input in 
the QF_BV theory: SearchMC [39] (with tolerance € = 0.8 and confidence 
1—6 = 0.95) and SMTApproxMC [11] (with tolerance € = 0.8 and 1 itera- 
tion). We found no tool supporting arrays, so arrays were blasted. As shown in 
Table 5, the quantitative approach is orders of magnitude slower in all cases, and 
especially in the one case where it is indeed significantly more precise than our 
qualitative approach (u-boot). 


Table 5. All-or-nothing (Z3) vs quantitative (SearchMC, SMTApproxMC) approaches: 
runtime and lower bound on r(ao). Timeout (TO) is 2,400s. 


doas libvnceserver u-boot mongoose 


Z3 0.02s 0% 0.01s 0% |0.07s 0% 0.04s 100% 
SearchMC 94s 10713 4.8s 10717|190.6s 25% 35.18 59% 
SMTApproxMC || TO — TO - TO — TO - 


Quantifier Alternations. We want to model a leak in ASLR in libvncserver 
(Sect. 6.2): the attacker knows about an address z and wants to use the bug 
to jump to z. The corresponding property is: for all values? of z, there exists 
an attacker input a such that for all other uncontrolled inputs x, control flow 
is diverted to z. This uses another universal quantifier, which we exclude in 
our definition of robust reachability to keep satisfiability queries tractable. We 
implemented this for libvncserver (additional quantification on the target jump 
address) and doas (additional quantification on the user and group ID of the 
attacker, and the typoed user name): RSE+ does not terminate within 24h. 
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Abstract. Hyperproperties are properties of computational systems 
that require more than one trace to evaluate, e.g., many information-flow 
security and concurrency requirements. Where a trace property defines 
a set of traces, a hyperproperty defines a set of sets of traces. The tem- 
poral logics HyperLTL and HyperCTL* have been proposed to express 
hyperproperties. However, their semantics are synchronous in the sense 
that all traces proceed at the same speed and are evaluated at the same 
position. This precludes the use of these logics to analyze systems whose 
traces can proceed at different speeds and allow that different traces take 
stuttering steps independently. To solve this problem in this paper, we 
propose an asynchronous variant of HyperLTL. On the negative side, 
we show that the model-checking problem for this variant is undecid- 
able. On the positive side, we identify a decidable fragment which covers 
a rich set of formulas with practical applications. We also propose two 
model-checking algorithms that reduce our problem to the HyperLTL 
model-checking problem in the synchronous semantics. 


1 Introduction 


Hyperproperties [8] extend the conventional notion of trace properties [1] from a 
set of traces to a set of sets of traces. In other words, a hyperproperty stipulates a 
system property and not the property of just individual traces. Many interesting 
requirements in computing systems are hyperproperties and cannot be expressed 
by trace properties. Examples include (1) a wide range of information-flow secu- 
rity policies such as noninterference [14] and observational determinism [28], 
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(2) sensitivity and robustness requirements in cyber-physical systems [27], and 
(3) consistency conditions such as linearizability in concurrent data structures [5]. 

HyperLTL [7] is a temporal logic for hyperproperties that enriches LTL with 
quantifiers allowing explicit and simultaneous quantification over multiple exe- 
cution traces. For example, the observational determinism security policy [28] 
stipulates that any two executions that start in two low-equivalent states (i.e., 
states whose value of publicly observable variables are the same), should remain 
in low-equivalent states. This property can be expressed in HyperLTL as the 
following formula, called yop,Va.Vr' (ln > lr) > Ol > lw). However, 
the semantics of HyperLTL (and other formal languages for hyperproperties) 
is synchronous, meaning that they completely abstract away the notion of time 
passage. In HyperLTL, all traces proceed at the same speed, as all temporal 
operators move the position on all traces simultaneously. Consider the program 
P; in Fig.1, where input values 0 and 1 are possible for high-secret variable h. 
This renders two possible traces shown in Fig. 4a that satisfy yop. 

The synchronous semantics of HyperLTL has a shortcoming which has prac- 
tical implications as well: formulas are not invariant under stuttering. Note that, 
contrary to LTL, disallowing the use of O does not make the formula invari- 
ant under stuttering, as traces can still stutter independently. This limits the 
scope of application of HyperLTL to only those settings where different traces 
can be perfectly aligned. For example, consider program P 2 in Fig.2, where 
line £4 in P4 is refined to its intermediate code using a register that stores the 
value 1+ 1 and then stores this value in memory location 1 in lines 44 and 
és, respectively. Applying the synchronous semantics of HyperLTL results in 
declaring a violation of yop in the second position. This, however, is not an 
accurate interpretation of yop (assuming that an attacker only has access to the 
memory footprint and not the CPU registers or a timing channel), as the two 
traces are stutter equivalent with respect to the state of variable 1. In fact, the 
synchronous semantics of HyperLTL may incorrectly identify good programs as 
bad because it ignores the notion of relative time between traces. This prob- 
lem is generally amplified in Kripke structures where self-loops correspond to 
non-deterministic choices that model that the system may remain in a state for 
some arbitrary time. For instance, consider K in Fig. 3 and HyperLTL formula 
Via! .((b > bar) U O(ag © ax )). Only pairs of traces that take the self-loop 
the same number of times satisfy this formula. However, since the goal of employ- 
ing a self-loop is typically to make the duration of staying in a state irrelevant, 
this semantics is too restrictive. 
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Fig. 4. Synchronous vs. asynchronous semantics for HyperLTL. 


Besides HyperLTL, other logics have been proposed that allow trace quan- 
tification, for example, H, [15], which extends the linear time -calculus [3] with 
path quantifiers and indexed next operators. For H,,, the model-checking prob- 
lem is in general undecidable, but two fragments, the k-synchronous, k-context 
bounded fragments, have been identified for which model checking remains 
decidable [15]. 

In this paper, we propose an asynchronous temporal logic for hyperproperties. 
Our main motivation is to be able to reason about execution traces according 
to the relative order of the sequences of actions in each trace but not about the 
duration of each action. Software is inherently asynchronous, and so is hardware 
in many cases if one abstracts the execution platform or many features of the 
execution platform like pipelines, caches, memory contention, etc. We call our 
temporal logic Asynchronous HyperLTL or in short, A-HLTL. The key addition is 
the notion of trajectory that controls the relative speed at which traces progress 
by chosing at each instant which traces move and which traces stutter. For 
example, the trajectory shown in Fig. 4c for the two traces of the program in 
Fig. 2 allows the lower trace to stutter in the first position while the upper trace 
advances. On the contrary, in the third position, the upper trace stutters while 
the lower trace moves from the second to the third position. This trajectory 
enables identification of stutter equivalence of the two traces with respect to 
state variable 1 and, hence, successful verification of observational determinism. 
In order to reflect the notion of trajectories in our logic, we lift the syntax 
of HyperLTL by allowing a trajectory modality. This way, the corresponding 
formula for observational determinism in A-HLTL is the following: 


pop = Van’ .E.(lig e ligt) > O(log © log) 


where E denotes the existence of a trajectory for temporal operator O. The 
A-HLTL formula for the Kripke structure in Fig. 3 is Va.V7'.E.((b¢ > bw) U 
(ar © az)). A-HLTL allows us to reason about relational properties between 
two different systems that differ on timing, like for example, translation valida- 
tion [22], which relates executions of the target code with the source code with 
respect to a (trace or hyper) property. 
We show an encoding of the PCP problem into model-checking a formula of 
the shape Yr.Yr'.E. (Iyı (2, 2’) AOwa(m, 7’)), which implies that model-checking 
A-HLTL is undecidable, even for the universal fragment. On the positive side, 
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we show two decidable fragments of A-HLTL. The first algorithm is based on 
a stuttering construction in which we modify the Kripke structure to accept all 
stuttering expansions of the original paths. This algorithm can handle fragment 
Yri... nn-E. p, where the y is a phase formula, a class of safety formulas that 
appear in many hyperproperties and are the building block of expressing trace 
equivalence. Our second algorithm uses an acceleration construction to convert 
a finite sequence of transitions that do not change phase, into a single tran- 
sition. This algorithm is able to handle formulas with arbitrary quantification 
but a simpler kind of phase formulas. A-HLTL is, thus, the first logic for hyper- 
properties that can express the major asynchronous hyperproperties of interest 
within decidable fragments. Moreover, A-HLTL is the first logic for asynchronous 
hyperproperties with a practical model checking algorithm. Both algorithms use 
internally HyperLTL model-checking as a building block. However, the reduc- 
tion from A-HLTL model-checking into HyperLTL requires modifying both the 
formula and the model in a highly non-trivial way, to encode the exitence of 
trajectories. The choice of using HyperLTL model-checking as a building block 
is based on the existence of tools, but it does not imply that asynchronous prop- 
erties of interest can be expressed in HyperLTL directly. 

We have evaluated the stuttering construction on two sets of cases studies: a 
range of compiler optimizations and an SPI bus protocol. In both case studies, 
we were able to prove system correctness using our reduction from A-HLTL to 
synchronous HyperLTL. 


Organization. The rest of the paper is structured as follows. Section 2 con- 
tains the preliminaries, and Sect.3 introduces A-HLTL and presents examples 
of properties expressible in A-HLTL. Section4 describes the decidable frag- 
ments and present procedures for the model-checking problem. Section 5 shows 
that the model-checking problem for general A-HLTL formulas is undecidable 
and present the lower-bound complexity. Experimental results are presented 
in Sect.6. Finally, Sect.7 discusses the related work, while Sect. 8 concludes. 
Detailed proofs appear in the longer version of this paper in [4]. 


2 Preliminaries 


Let AP be a set of atomic propositions and © = 2°” be the alphabet, where 
we call each element of X a letter. A trace is an infinite sequence o = aga :-- 
of letters from X. We denote the set of all infinite traces by £“. We use a(t) 
for a; and o* for the suffix ajaj41---. A pointed trace is a pair (o,p), where 
p € No is a natural number (called the pointer). Pointed traces allow to traverse 
a trace by moving the pointer. Given a pointed trace (c, p) and n > 0, we use 
(o, p) +n as a short for (o, p + n). We denote the set of all pointed traces by 
PTR = {(o,p) |o€ ©” and p € No}. 

Two pointed traces (a, p) and (0’, p’) are stuttering equivalent if there are two 
infinite sequences of indices p = ig < i,... and p' = jg < 71... such that for all 
k > 0 and for all l € [ik, ik+1) and l’ € [jx, jk+1), a(l) = o'(U). A pointed trace 
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(o’,p’) is a stuttering expansion of (ø, p) if there is a sequence p' = jo < ji <... 
such that for all k > 0 and for all 1 € [jx, jk+1), o(p +k) = 0'(1). We say that o 
is stuttering equivalent to o’ if (o,0) is stuttering equivalent to (o’,0), and that 
o’ is a stuttering expansion of o if (o’,0) is a stuttering expansion of (0,0). 

A Kripke structure is a tuple K = (S, Sinit, ô, L), where S is a set of states, 
Sini C S is the set of initial states, ô C S x S is a transition relation, and 
L: S — X is a labeling function on the states of K. We require that for each 
s E€ S, there exists s’ € S, such that (s,s’) € ô. 

A path of a Kripke structure is an infinite sequence of states s(0)s(1)--- € 
S”, such that s(0) € Sinig and (s(i),s(i+1)) € ô, for all i > 0. A trace of a 
Kripke structure is a trace a(0)o(1)o(2)--- € X”, such that there exists a path 
s(0)s(1)--- € SY with a(i) = L(s(i)) for all i > 0. Abusing notation we use 
o = L(p) to denote that o is the trace corresponding to path p. We denote by 
Traces(K,s) the set of all traces of K with paths that start in state s € S, We 
denote by Traces(K, A) the set of all traces that start from some state in A C S 
and Traces(KC) as a short for Traces(K, Sinit). 


HyperLTL. HyperLTL [7] is a temporal logic that extends LTL [19,21] for 
hyperproperties, which allows reasoning about multiple execution traces simul- 
taneously. The syntax of HyperLTL is: 


pus Iry | YT. p | p 
pr=ar | VV |w | Oy | yuy 


where 7 is a trace variable from an infinite supply of trace variables. The intended 
meaning of a, is that proposition a € X holds in the current time in trace 
am. Trace quantifiers da and Yr allow reasoning simultaneously about different 
traces of the computation. Atomic predicates a, refer to a single trace 7. Given 
a HyperLTL formula y, we use Vars(y) for the set of trace variables quantified 
in y. A formula ¢ is well-formed if for all atoms a, in y, m is quantified in y 
(i.e., m € Vars(w)) and if no trace variable is quantified twice in y. Given a set 
of traces T, the semantics of a HyperLTL formula y is defined in terms of trace 
assignments, which is a (partial) map from trace variables to indexed traces 
IT : Vars(y) — PTR. The trace assignment with empty domain is denoted by 
Ig. We use Dom(IT) for the subset of Vars(y) for which J is defined. Given a 
trace assignment JT, a trace variable 7, a trace g and a pointer p, we denote 
by H[|r + (o,p)] the assignment that coincides with J for every trace variable 
except for 7, which is mapped to (ø, p). Also, we use JI + n to denote the trace 
assignment IZ’ such that I’(r) = H(m) +n for all r € Dom( H) = Dom(II'). 
The semantics of HyperLTL is: 


=r dnp iff for some o € T, [m+ (¢,0)] Er p 
Er Vr.y iff for alla € T, [7 > (0,0)| Er y 
Er yY iff TEw 

= ar iff a € o(p), where (0, p) = H (r) 
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I =r v1 V v2 iff I =r y% or H Er p2 
I = ~y iff I E y 

I H Oy iff  (H+1) Ey 

IT 


= y U we iff for some j > 0 (H + j) H we 
and for all 0 < i < j, (II + i) = yy 


Note that quantifiers assign traces to trace variables and set the pointer to the 
initial position 0. We say that a set of traces T is a model of a HyperLTL formula 
p, denoted T H y whenever Hg Er y. A Kripke structure K is a model of a 
HyperLTL formula y, denoted by K — y, whenever Traces(K) = y. 


3 Asynchronous HyperLTL 


We introduce a temporal logic A-HLTL as an extension of HyperLTL to express 
asynchronous hyperproperties. 


Trajectories. To model the asynchronous passage of time, we now introduce the 
notion of a trajectory, which chooses when traces move and when they stutter. 
Let V be a set of trace variables and let J C V. The J-successor of a trace 
assignment JI, denoted by I + J, is the trace assignment I’ such that II'(7) = 
I(r) +1 ifr € I and M (r) = H(t) otherwise. That is, the pointers of indices 
in I advance by one step, while the others remain the same. A trajectory t : 
t(0)¢(1)¢(2)--- for a formula y is an infinite sequence of non-empty subsets of 
Vars(y). Essentially, in each step of the trajectory one or more of the traces 
make progress. A trajectory is fair for a trace variable m € Vars(y) if there are 
infinitely many positions j such that m € t(j). A trajectory is fair if it is fair 
for all trace variables in Vars(y). Given a trajectory t, by tt, we mean the suffix 
t(z)t(i + 1)---. Furthermore, for a set of trace variables V, we use TRJy for set 
of all trajectories for indices from V. 


3.1 Syntax and Semantics of Asynchronous HyperLTL 
The syntax of Asynchornous HyperLTL is: 


p= Ir.y | Yr.y | Ey | Ad 
Y:=ar |Y |V VY |y U p| OY 


where a € AP, 7 is a trace variable from an infinite supply V of trace variables, E 
is the existential trajectory modality and A is the universal trajectory modality. 
The intended meaning of E is that there is a trajectory that gives an interpre- 
tation of the relative passage of time between the traces for which the temporal 
formula that relates the traces is satisfied. Dualy, A means that for all trajec- 
tories, the resulting alignment makes the inner formula true. It is important 
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to note that there is no nesting of trajectory modalities and that all temporal 
operators in a formula are interpreted with respect to a single modality. 


: def 
We use the usual syntactic sugar for Boolean operators true = ar V “az, 


false S atrue, pi A p2 g ~(~g1 V ~y2), and the syntactic sugar for temporal 


operators Op f true U P, Pi > P2 d ~g V p2, and Oy = O79, etc. 

As before, we use trace assignments for the semantics of A-HLTL. Given 
(II, t) where IT is a trace assignment and t a trajectory, we use (J7,t) + 1 for 
the successor of (J, t) defined as (I7’,t’) where t = t!, and M (r) = I(r) +1 if 
m € t(0) and H' (r) = H(r) otherwise. We use (J, t) + k as the k-th successor 
of (II, t). 

The satisfaction of an asynchronous HyperLTL formula y over a trace assign- 
ment II and a set of traces T, denoted by IT |r ¢ is defined as follows: 


I Er arp iff for some o € T : H|r > (0,0)| Er y 
IT Hr Vr.g iff for all o € T : H[r > (0,0)] Er p 
IT =r Ey iff for some t € TRJ pom): (M, t) Fv 
II r Ab iff for all t € TRS pom): (IT, t) = Y 

(II, t) On iff a € IT(m)(0) 

MAH y if ey 

U,t)F v1 Vy% iff (II, t) F Yı or UI, t) E y2 

U,t) = O% iff (M,t)+1H Y% 

(I, t) H= yı Uy iff for some i > 0: (I, t) +i | 2 and 


for all j < i: (H, t) +j H yi 


We say that a set T of traces satisfies a closed sentence y, denoted by T E y, 
if Ig Er y. We say that a Kripke structure K satisfies an A-HLTL formula y 
(and write K } y) if and only if we have Traces(K, Sinit)  y. 


3.2 Examples of A-HLTL 


We illustrate the expressive power of A-HLTL by introducing the asynchronous 
version of well-known properties. 


Linearizability. [16] requires that any history of execution of a concurrent data 
structure (i.e., sequence of invocation and response by different threads) matches 
some sequential order of invocations and responses: 


pinz & Wr.An’.E.O (history, = history) 


where history denotes method invocations (and not the actual execution of the 
internal instructions of the concurrent library) by the different threads and the 
response observed, trace m ranges over the concurrent data structure and 7’ 
ranges over its sequential counterpart. 
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Goguen and Meseguer’s Noninterference (GMNI). [14] stipulates that, for all 
traces, the low-observable output must not change when all high inputs are 
removed: 


PGMNI =< Varn’ .E(OAg) A O(log © log) 


where Aw expresses that all of the high inputs in the current state of 7’ have 
dummy value A, and denotes low-observable output proposition. 


Not never Terminates. [18] requires that for every initial state, there is a ter- 
minating trace and a non-terminating trace: 


onnt $È Yr.In' In" .E.(n[0] = m'[0] = 2 [0]) > (terme A Ooterm,) 


Termination-Insensitive Noninterference. [25] requires that for two executions 
that start from a low-observable states, information leaks are permitted if they 
are transmitted purely by the program’s termination behavior. That is, the pro- 
gram may diverge on some high inputs and terminate on others: 


TIN qf Wr.Wa! E. (lx > Ln) = ( aterm, V Oatermy) V ) 


O(term, A termy A lr > lr) 


Termination-Sensitive Noninterference. [2] Termination-sensitive noninterfer- 
ence is the same as termination insensitive, except that it forbids one trace to 
diverge and the other to terminate: 


e aterm, A Uatermy ) V 
PTSN def Wve! E. (I, < Im) as ( ) 
O(term, A termy A lr > lr) 


4 Model-Checking A-HLTL 


In this section, we show the decidability of the model-checking problem for two 
classes of A-HLTL formulas using two different algorithms: 


(1) a stuttering construction in which we modify the Kripke structure K to 
accept all stuttering expansions of paths in K; and 

(2) an acceleration construction in which the modified Kripke structure accel- 
erates jumping directly to the synchronization points. 


In both cases the problem is reduced to model-checking HyperLTL formulas, 
which is known to be decidable [7,12]. We describe each construction separately. 
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4.1 The Stuttering Construction 


We consider first A-HLTL formulas of the form V7,...7.E.w. We will then 
extend our results to the 4* fragment, to handle the A trajectory modality and to 
a larger collection of predicates. The class of temporal formulas w that we handle 
are called admissible formulas, and are defined as the Boolean combination of: 


1. any number of state formulas, which may relate propositions pr, of different 
traces arbitrarily; 

2. any number temporal formulas (called monadic temporal formulas), each of 
which only uses one trace variable and is invariant under stuttering (guaran- 
teed for example by forbidding the use of O), and 

3. one phase formula, which is an invariant that can relate different traces in a 
restricted way (see below). 


Given an admissible formula 7, we use %pn for its phase formula, and we use 
[pn <E] for the formula that results from ~ by replacing wp, with £. Since Ppr 
occurs only once in Y, we use the fact that Wp, appears with a single polarity. 
We present here the construction for positive polarity which is the case in all 
practical formulas (the case for negative polarity is analogous). 

The algorithm has two parts. First, we generate the stuttering Kripke struc- 
ture X whose paths are the stuttering expansions of paths in the original Kripke 
structure K. Then, we modify the admissible formula 7 into Ysyne such that 
KEV... %-E.w if and only if C* K Yr.. .Tn-Wsync- We describe each of the 
concepts separately. 


Phase Formulas. We first define atomic phase formulas (Npe p Px;  Pr;) Which 
are characterized by (m;, mtj, P), where P C AP and 7; and 7; are two different 
trace variables. We use color to refer to a valuation of the variables in P. Essen- 
tially, an atomic phase formula asserts that all propositions in P coincide in 
both traces at all points in time, that is, both traces exhibit the same sequence 
of colors. Since the passage of time proceeds at different speeds in the different 
traces—according to the trajectory—atomic phase formulas state the traces for 
T; and 7; are sequences of phases of the same color, where corresponding phases 
may have different lengths. A phase formula is formed from atomic formulas as 
follows: 


(A Pr Pra AA A Dae > Pat) 

pEPi pEPr 
We use P : {(a},7;,P'),...,(af, n}, P*)} for the collection of predicates and 
trace variables that characterize a phase formula. 


Stuttering Kripke Structure. We start from K and create K*' that accepts the 
stuttering expansions of traces in K. First, the alphabet of atomic propositions is 
enriched with a fresh proposition st, that is AP% = APU {st}, to encode whether 
the state represents a real move or a stuttering move. Given K = (S, Sinit, 6, L), 
the stuttering Kripke structure is K%! = (9%, Sinit, 0°", LS’) where: 
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— S%* = SU{s* | s € S} contains two copies of each state in S, where we use 
s* to denote the stuttering state that corresponds to s; 

— 6% = dU {(s, 5% )} U {(5%, 5%*)} U {(5%, s") | for every (s, 5’) € ô}. 

— L*(s) = L(s) for s € S, and L**(s%*) = L(s) U {st}. 


The construction generates a Kripke structure ®t which is linear in the size of 
the original Kripke structure K. It is easy to see that every stuttering expansion 
of a path of K has a corresponding path in Kt, where the repeated version of 
state s is captured by state s*’. Conversely every path p’ in Kt whose trace 
satisfies NO-st can be turned into its “stuttering compression” by removing 
all stuttering states, which is a path of K. Note that the constraint HO-st 
guarantees that there are infinitely many non-stuttering positions in p’, so p 
is well-defined. Hence, this constructions provides a one-to-one correspondence 
between a trajectory toguether with a tuple of traces of K, and the corresponding 
tuple of traces of K*. 


State and Monadic Formulas are not Affected by Trajectories. State formulas 
are relational formulas that are evaluated at the beginning of the computation. 
Temporal monadic formulas only refer to one trace variable and are stuttering 
invariant by definition. Therefore, none of these formulas are affected by the 
stuttering induced by a trajectory, as the relative stuttering among traces does 
not affect their truth valuation. We first note that given a trace assigned for each 
of the trace variables in Vars(y) the truth value of state formulas and monadic 
formulas does not depend on the trajectory chosen. 


Phase Alignment of Asynchronous Sequences. We use the stuttering in K% to 
encode the relative progress of traces as dictated by a trajectory. We will now 
introduce synchronous HyperLTL formulas to reason in K* about the corre- 
sponding states during the asynchronous evaluation in K. The important con- 
cept is that of “phase changes”, which are the points in a trace ø at which the 
valuation of the predicates P in an atomic phase formula (7;,7;, P) change. Let 
IT be a trace assignment for traces in K that maps 7; to a pointed trace (a, 1). 
We say that in assignment IT, trace variable m; is about to change phase with 
respect to (7;,7;, P) if for some p E P either p € o(l) but p ¢ o(l+1) or p ¢ a(l) 
but p € o(1+1). Note that in K* the next relevant letter (the one corresponding 
to a(1+1) is the first letter that is not a stuttering letter). Formula change p(7;) 
captures that the next non-stuttering step of 7; is a phase change (with respect 
to predicates in P and therefore with respect to atomic phase formula a): 


change p(7;) E VV Pri P Ol Sta, U Dai) 
pEP 


A phase change for 7; in atomic phase formula (7;, 7j, P) implies that 7; must 
also proceed to change phase. The second observation is that when m; and 7; 
are not changing phases, any choice that the trajectory makes will preserve the 
valuation of the atomic phase formula. 
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We now capture formally this intuition as formulas. Predicate move(m;) 2 
O(-st,,) indicates whether trace variable m; will move (and not stutter) at a 
given instant of the computation. The following temporal formula captures the 
consistency criteria of phase changes as a synchronized decision for moving traces 
m; and T; related by an atomic phase formula (7;,7,;, P): 


(move(m;) ^ move(n;)) > (changep(m) = changep(m;)) A 
(move(m;) A amove(n;)) + achangep(m™) A 


(=move(m;) ^ move(m;)) + achangep(m;) 


We will reduce the model-checking problem in A-HLTL to checking in K* that 
tuples of traces that align phase changes—for all atomic phase formulas— satisfy 
all sub-formulas of the specification ~. The following two formulas express that 
all atomic phase formulas align, and that all traces are fair (all traces eventually 
move): 


def : . def 
phase = VAN align(n, nj, P) fair È \ Onst; 
(mi;mi, P)EP mi E{T1...7n} 


We will then check in K* that all stuttering traces that align phases and are fair 
satisfy the desired formula ~, that is (Aphase A fair) > w. Note that all those 
tuples of traces that do not align phases are ruled out in the antecedent. 

A final technical detail in the construction is that we must guarantee that 
for all tuples of paths of K there are stuttering expansions that are fair and 
align phases, and that they have the same number of phases. Otherwise, there 
are paths of K that cannot be aligned, which inevitably leads to a violation 
of Yph. It could be the case that some tuple of traces of K cannot possibly 
align the phase changes corresponding to all atomic phase formulas. This can 
happen in two cases: (1) when two traces have different number of phases, and 
(2) when there is a circular dependency between the atomic formulas that force 
the trajectory to synchronize the traces in incompatible orders. The first case is 
captured by: 


missalign f VV ( schangep(7)) + ( “change p(r3)) 
(minj, P) 


The second case is captured by the following formula, where cycles(w pn) 
are the sequences of atomic formulas that form a simple cycle, that is 
[(7°, 21, P°), (a+, 27, Pt)... (n¥, 2°, P*)] such that the second trace variable is 
the first trace variable of the next atomic phase formula, circularly (see Ex. 1 
below): 


block 4% VV ( VAN changep(m™) A schangep(r3)) 
CEcycles(Wpn) (ni,ti, P)EC 
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Essentially, block encodes whether the set of traces involved cannot proceed 
without violating phase, because align forbids all traces involved to move. Hence, 
the formula phase U (missalign V block) captures to those traces of K* that 
contain an aligned prefix of computation that lead to a miss-alignment or a 
block. The proof of correctness shows that given a tuple of traces of K, if there 
is a trajectory that aligns the phase changes (which must exist if there is a 
trajectory that makes pn true), then all trajectories that respect Ophase will 
also align the phase changes (and also satisfy qpp). 

We are finally ready to describe the synchronous phase formula {sync. First, 
this formula is only evaluated against tuples of fair traces, which correspond to 
the stuttering extensions of paths of K. Then, the phase formula wp, is translated 
into a formula that captures (1) that following a phase alignment cannot lead 
to a block or to two traces changing phases a different number of times, and (2) 
that if phases are aligned then Wp, holds. Formally, 


e —~(phase U (missalign V block)) A 
Wayne 2 fair > Ylyph ay], where v=( (p (missalig )) ) 


phase > Pph 


Example 1. We illustrate the previous definitions with the Kripke structures 
Kı, Kz and Kz in Fig. 5 and their stuttering variants f, K3 and K$t Consider 
formula Yr1.Yr2.E.O(ar, > ar,). Consider the following trace assignments: 


I (m) => {} {st} {a}... | Pm) > Q {a} {a} Pm) GGG -- 
M (m) = {} {} {a} ...| Pm) > G G {a} ... | LP (m2) = {a} {} {a}... 


Consider the trace assignment J! on the left, where mı is a trace of Kf* 
corresponding to the path of Kı that visits s;, and m2 corresponds to the 
path that visits s2. This trace assignment aligns the atomic phase formula 
(71,72, {a}) at all positions. In particular, at position 0, we have change,,; (71), 
but =changeta} (T2), and ~move(7) and move(72), as aligns, requires. 


Fig. 5. Kripke structure Kı (left), K2 (middle) and K3 (right). 
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r x s 3 . 
Tm > {} {} {a} {ec}... l Consider now the trace assignment 4 in the 
middle, where again mı corresponds to the path 
H (r2) = {} {} {b} {a}... in Kf that visits sı and m2 the path that vis- 
IT (m3) + {} {} {c} {b}... its sg. In this case, we have naligns a} at posi- 


tion 0 because change,,;(71) and =changera} (72) 
hold, and both move(m1) and move(m2). Consider 
now JI? on the right, where mı corresponds to the path of Xt that vis- 
its s3 and mo to the path of K3 that visits s4. In this case aligns, holds 
at 0 and missalign holds at 1 because at 1, O-change,,;(771) holds, but not 


achanges,} (72). Therefore, phaseU (missalign V block) holds for IT? Finally, 
consider V7.V79.V73.E.O(an, © Gg, A bro = brs A Cra © Cro) and the 
trace assignment J of K§' shown below on the left. In this case phase holds 
at position 0 and block holds at position 1. This is because change,,}(71) and 
achanges 4} (T2), changes, (m2) and ~change,y (73), and change,,,(73) and also 
—=change;e} (71). This illustrates that it will not be possible to align all three 
atomic phase formulas. 

We are now ready to state the main result of this section. 


Theorem 1. Let K be a Kripke structure and w an admissible formula. Then, 
K = Yri ...nn-E.¢% if and only if K* = Vm... Tn-Wsync: 


Dually, to show that the 4* fragment is decidable, we consider replacing Ypa by 
the formula 


A aid del fair A b[Wpn < (Ophase A Wpn)] 


Theorem 2. Let K be a Kripke structure and w an admissible formula. Then 
K = JAn....E.w if and only if K* = Am... Tn-Wesync: 


The proof of Theorem 2 takes a witness tuple and trajectory in K and shows 
that the induced tuple in K* is fair, satisfies Ophase and that the valuation of 
Wpn is preserved. Similarly, as before, tuples of traces of K*’ that are fair and 
follow phase alignments induce a trajectory on their stuttering compression that 
also preserve Pph- 


Corollary 1. The problems of model-checking V* admissible A-HLTL formulas 
and 4* admissible A-HLTL formulas is decidable. 


We finally consider the negation of phase formulas, called co-phase formulas, 
which are formulas of the form ©~—R where R a conjunction of atomic phase 
formulas. Interestingly, deciding co-admissible formulas (consisting of Boolean 
combinations of state-formulas, monadic temporal formulas and one co-phase 
formula in positive polarity) is easier than before, as one can turn the co-phase 
formula into a monadic formula enumerating all the violations of the atomic 
phase formulas (p € P such that pr, # p,x,) turns the atomic phase formula into 
(Opr; A Onpr;) V (Onpz, A Opr, ). It follows that model-checking co-admissible 
formulas is also decidable (for both V* and 3*). Note that an admissible formula 
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in negative polarity is a co-admissible formula in positive polarity (and vice 
versa). Finally, since X | Vi1....Vap.Awv if and only if K j 4m... dm,.E.nd, 
it follows that model-checking is also decidable for the A modality for both 
admissible and co-admissible formulas (in both polarities), and for both the V* 
and 4* fragments. 


Theorem 3. Model-checking V* or 3* admissible and co-admissible formulas is 
decidable both for formulas with E and formulas with A. 


4.2 The Accelerating Construction 


The admissible formula in the stuttering construction can express many formulas 
of interest, but the quantifier structure admits no quantifier alternation. We now 
consider a second decidable fragment for A-HLTL formulas consisting of formulas 
with arbitrary quantification Q,7.Qo72..... QnTn E.Y such that Q; € {V, 3}, but 
where w is an admissible formula where all atomic phase formulas use the same 
atomic predicates P C AP. We call these admissible formulas simple admissible 
formulas. The proof of decidability proceeds this time by creating the accelerated 
Kripke structure K°°, where paths jump in one step to the next phase change, 
and reducing to a HyperLTL model-checking problem on K°°. 


Accelerated Kripke Structure. The main idea of the acceleration construction is 
to convert a finite sequence of transitions in K that only change phase in the last 
transition into a single transition in K °°. Also, an infinite sequence of transitions 
with no phase change is transformed into a self-loop around a sink state. The 
alphabet remains the same, AP. Given K = (S, Sinit, 6, L), the accelerated Kripke 
structure is Ke = (S'%°, Sinit, 0%, L°) where: 


— $e = SU{s, |s E€ S} contains two copies of each state in S, where we use 
s1 to denote the sink state associated with s. We use color(s) for the phase 
of s, that is, the concrete valuation in s of the Boolean predicates in P of the 
atomic phase formula. 

— For every states s,s’ E€ S such that color(s) # color(s’), if there is a finite 
path ss9s83...8,8’ in K such that color(s) = color(s2) = --- = color(s,), then 
we add a transition (s,s’) to 6°. These transitions model the jump at the 
frontier of phase changes. Additionally, if s can be a sink we add a transition 
(s,s) and a self-loop from s1 to itself. 

— L°°(s) = L(s) for s € S, and L°°°(s,) = L(s). 


This construction can, with standard techniques, be enriched to encode the 
satisfaction of the temporal monadic formulas along paths of K, and then also 
accelerate the fairness conditions (annotating the accepting states reached along 
the accelerated paths) into K%°. 


Relating Paths to Accelerated Paths. We now define two auxiliary functions to 
aid in the proof. 
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— The first function, acc, maps paths in K into paths in K°°°. Let s be an 
arbitrary state of K and p: ss15953... an outgoing path from s. Either there 
are infinitely many phase changes in p or only finitely many changes. We 
create the path p’ = acc(p) as follows. The initial state of p, that is, s, is 
preserved. The states s;, in ø that are color changes (that is color(s;,-1) 4 
color(s;,) are also preserved, while the states są with color(s,—1) = color(s,) 
are removed from p. If there are only finitely many color changes in p, with r 
being the last state preserved, then we pad the path with rẹ, so p’ is also an 
infinite path. It is easy to see that p’ is a path of K°* outgoing s. It is also 
easy to see that the phase changes in p and p’ are the same. 

— The second map, dec, takes a path p’ : ss1 s5... of K%* and maps it to a path 
of K as follows. For every transition (sj, s/,,) in p such that sj, is not of the 
form r1, there is a finite path ryrg...rm in K from s; into sj,, that visits 
only states with the same color as s4, except s;41 that is a color change. In 
p, we insert rır ...fm between s; and s/,,. Now, if for some J, si is of the 
form rı then si, = r1 for all k > j. In K there must an infinite path from si 
that only visits the same color as s}. We remove all successor states after the 
first such rı state and replace it with one such infinite path. 


Given a trace assignment I for formula Q171. ...QnTn.-E.Y that assigns 
II(m) = (0;,0) for every i and a path assignment J’ for formula 
Qimy..... Qnimn-W that assigns I'(mi = (o;,0), we write acce( IT) = II’ if the 
paths that generate the corresponding traces are related by acc. Similarly we 
defined dec(II’) = IT. It is easy to show from the construction above that if 
IT |= Ey then acc(I7) — 4, and if I H w then dec(I’) H Ey. 

The main result for the accelerating construction follows immediately from 
this observation and allows to reduce the model-checking problem to HyperLTL. 


Theorem 4. Let K be an arbitrary Kripke structure, Qi7..... Qntn-E.w such 
that w is a simple admissible formula. Then K = Qu71...-QnanE.w if and only 
if °° = Qym...-QnItn-w. 


4.3 Decidable Practical A-HLTL Formulas 


We revisit the properties expressed in Sect. 3.2. 


— Linearizability. The property yinz is of the form Vz.da’.E.D (history, © 
history...) where the temporal formula is a simple admissible formula. There- 
fore y_nz is decidable by the accelerating construction. 

— Goguen and Meseguer’s non-interference. The property pomni is expressed 
by Va.da’.E(OA,) A O(log e low), that is, a Boolean combination of a 
monadic temporal formula and a simple admissible formula. Therefore, yg¢mmi 
is decidable by the acceleration algorithm. 

— Not never terminates. Formula Ynyr is simply a Boolean combination of 
state formulas and monadic temporal formulas: Yr.3r'.3n”.E.(m[0] = 2’ [0] = 
T”[0]) — (term, A Onaterm,), so it is again decidable by the accelera- 
tion construction. 
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— Termination-insensitive noninterference. To handle ytin we rewrite the for- 
mula as follows 


e aterm, V Linterm,,) V 
prn = Yrvn'.E. (1y => In) = ( ) 


(i A term,) © (lw A term,)) 


Note that (i, A term,) can be turned into a state predicate of 7. This formula 
is equivalent because the last case is evaluates precisely to |, < ly when both 
traces terminate. This formula can be handled by the stuttering construction. 

— Termination-sensitive noninterference. Similarly, to handle ytsn we rewrite 
the formula as 


e aterm, A Linterm,,) V 
prsu = va Wn! E. (In o Im) E f ) 


(In A termz) > (lx A termw)) 


This is again equivalent because the last case again is the only relevant case 
when both paths terminate. Again, this case is covered by the stuttering 
construction. 


5 Undecidability and Lower-Bound Complexity 


In this section, we show that the general problem of model-checking A-HLTL 
is undecidable. Then, we show a polynomial reduction from the synchronous 
HyperLTL model-checking into A-HLTL model-checking, which shows that even 
for those A-HLTL formulas for which the model-checking is decidable, this prob- 
lem is no easier than the corresponding problem for HyperLTL, which is known 
to be PSPACE-hard in the size of the Kripke structure. 


Theorem 5. Let K be a Kripke structure and y be an asynchronous HyperLTL 
formula. The problem of determining whether or not K = is undecidable. 


Proof (sketch). We reduce the complement of the post correspondence problem 
(PCP) [23,26] to the A-HLTL model checking problem. PCP consists of a set of 
dominos, for example, of the form [2] = {[4], [4], [“¢],[“¢]} and the problem 
is to decide whether there is a sequence of dominos (with possible repetitions), 
such that the upper and lower finite strings of the dominos are equal. A solution 
to the above set of dominos is the sequence [-4][4][£4]{-4][“¢]. We map a given 
set of dominos to a Kripke structure that allows arranging the dominos in a 
sequence (see Fig.6 for an example), where v and w indicate lower and upper 
words, respectively, dom’ is for each domino [*], and proposition lc marks 


whether or not a new letter is processed. The A-HLTL formula in our reduction 


is the following such that domy, q Viep..x] domi: 


def 
Prep = Vrwa-E.( Pape > (Pdomino V T 
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© Initial state 
@ v-trace 


© w-trace 
@ End state 


3 
wa 


( {dom4, w}) 


u 


A a 


{wt Je {w, 8, l} ) 


Fig. 6. Mapping from PCP to model checking A-HLTL (only construction for dominos 
a] = [2] and l= [=] are shown). 


where Ptype = (lwr, A Ur, ) U endr, ) A (Owr, A Ur, ) U endr, ) 
e k y . 
domino de (domr „ > domr,) AO VV dom’, # dom, 
i=1 
Pword = (len,  len,) AO V (lru ln) 
LEX pep 


The intention of formula yy is that the Kripke structure is a model of the 
formula if and only if the original PCP problem has no solution. Intuitively, 
formula Ytype forces trace Twy (respectively, my) to traverse only the traces labeled 
by w (respectively, v) to build a w-word (respectively, v-word). Formula domino 
establishes that the trajectory aligns the positions at which the domino indices 
are checked and at last once the index is different. Finally, formula Ywora captures 
if ty and T, are aligned to compare the letters, at least one pair of the letters 
prescribed by the existential trajectory are different. In the detailed proof in [4], 
we show that the constructed Kripke structure satisfies formula Yypep if and only 
if the answer to deciding PCP is negative. 


Theorem 5 above implies that there is no algorithm to decide the model- 
checking problem correctly for every formula and every system. However, as we 
saw in Sect.4 for some formulas the model-checking problem is decidable. We 
now show that in these cases the problem is at least as hard as model-checking 
HyperLTL, which is known to be PSPACE-hard [7, 24]. 


Theorem 6. Given a HyperLTL formula p and a Kripke structure K there is a 
A-HLTL formula vy’ and a Kripke structure K’ such that K’ is linear in the size 
of K, vy’ is polynomial on the size of p and K E ọ if and only if K! = ¢’. 


The proof proceeds as follow. Giving K we build a Kripke structure K’ that 
alternates between real states in K and synchronization states. Then the formula 
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is transformed to force alternations at every other step, therefore forcing the 
trajectory to synchronize (see [4] for details). Since the model-checking problem 
for HyperLTL is PSPACE-hard on the size of the Kripke structure, the same 
follows for A-HLTL. 


Corollary 2. For asynchronous HyperLTL formulas, the model checking prob- 
lem is PSPACE-hard in the size of the system. 


6 Case Studies and Evaluation 


We applied our algorithm for the VžE A-HLTL fragment to several examples. 
After manually reducing the asynchronous model checking problem to a syn- 
chronous one, we use MCHYPER [10,11] to check our property. MCHYPER is a 
model checker for synchronous HyperLTL that can handle formulas with up to 
one quantifier alternation. It computes the self composition of the system and 
composes it with the formula automaton. ABC [6] is then used as the backend 
tool checking the reachability of a violation. 

Our reduction from the asynchronous to the synchronous semantics follows 
the stuttering construction described in Sect.4.1. To model check a system 
against an A-HLTL formula, we first add a stuttering input to the system that 
forces the system to stutter in the current state. The transformed formula ensures 
that the stuttering guarantees synchronous phase changes. In future work, we 
will fully automate our reduction resulting in a verification tool for asynchronous 
hyperproperties from the decidable fragment. We now describe the various case 
studies!. All our experiments were performed on a MacBook Pro with a 3.3 GHz 
processor and 16 GB of RAM running MacOS 11.1. 


6.1 Compiler Optimizations 


We modeled the source and target programs of different compiler optimization 
techniques (from [20]) as finite state machines encoded as circuits, and used asyn- 
chronous hyperproperties to prove the correspondence between both programs. 
We analyzed the following optimizations: 


— Common Branch Factorization (CBF), where expressions occurring in both 
branches of a conditional are factored out; 

— Loop Peeling (LP), which consists in unrolling of a loop that is executed at 
least once; 

— Dead Branch Elimination (DBE), that is, removing conditional checks and 
their branches that are unreachable; and 

— Expression Flatting (EF), which splits complex computations into several 
explicit steps. 


1 The experimental data is publicly available at https: //github.com/reactive-systems/ 
MCHyper in case-studies/asynchronous-hyperlt1_2021. 
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Table 1. Verification times of MCHyYPER and system sizes in number of latches (#Ls) 
and AND-gates (##ANDS) for the case studies. 


System Size 
Optimizations i Time (s) 
HLS #ANDS 


EF 12 64 0.6 
DBE 16 128 0.8 System Size F 
Propery Time (s) 
CBF 16 145 2.7 #LS #ANDS 
LF a Se Soa SPl-correct 30 175 65.7 
CBF+DBE 16 137 11.4 SPl-term 33 296 155.8 
CBF+DBE+EF 20 175 10.0 
CBF+EF 20 180 1.7 
EF+LP 41 8642 1315.2 
(a) Compiler Optimizations (b) SPI 


Besides evaluating each optimization individually, we also examined several 
combinations of these optimizations. Each optimization affects the alignment 
between source and target program, so synchronous hyperproperties fail to rec- 
ognize the correspondence between both programs. Using asynchronous hyper- 
properties instead allows us to compensate for this misalignment by stuttering 
the programs accordingly. Essentially, each optimization is checked against the 
following A-HLTL formula in which m represents traces from the source program 
and 7’ traces from the target program: 


vrv. E(N ig © iw) > ( VAN Or > On’) 


tel o€O 


This formula states that for all pairs of traces that initially agree on the inputs 
from the set J there exists a trajectory that aligns the phase changes of the 
outputs in set O. We use the stuttering construction and MCHyYPER to verify 
that in all cases the source and target programs go through the same phases 
of possibly different length. The results of this case study are summarized in 
Table l(a). We note that A-HLTL model-checking subsumes the approach in [20] 
based on construction of a buffer automaton to reason about the alignment of 
executions. 


6.2 SPI Bus Protocol 


The Serial Peripheral Interface (SPI) is a bus protocol that supports a single 
main component’s communication with multiple secondary components. Each 
secondary can be selected individually by the main via the secondary’s own ss 
(“secondary select”) input signal. If a secondary is enabled (that is, if —ss holds 
as the secondary select is “active low”), it reads the mosi (main out, secondary 
in) signal and writes to the miso (main in, secondary out) wire. 
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We verify the behavior of a single SPI secondary component that receives 
an input which it sends to the main component upon request. This behavior 
should always be the same, independent of when the secondary is enabled or 
how fast the bus protocol’s “serial clock” (sclk) set by the main component ticks 
compared to the secondary’s internal clock. The A-HLTL formula we check is 
the following (see observational determinism in Sect. 1): 


A in © Ur (miso, A asclk, A 7887) 
Wan! E i€{in, init} 
MIA RN. .C. A =} < 
SPI input assumptions (mison A nsclka! A nssr ) 


This formula (called SPl-correct in Table 1(b)) ensures that for all pairs of 
traces 7 and 7’ that agree on the initial configuration, on the input, and addi- 
tional SPI input assumptions, there is a trajectory that aligns their relevant 
behavior. We consider it relevant that both secondaries agree on their miso out- 
put whenever they are enabled and the sclk is low. Checking miso only when 
the sclk is low is sufficient as changes on miso only occur at falling edges of 
the sclk. The SPI input assumptions are required to guarantee the implicit 
assumptions of the protocol, for example, that the sclk behaves as an infinitely 
ticking clock. By introducing additional variables and applying logical transfor- 
mations, we obtain an equivalent formula that syntactically lies in the fragment 
of the stuttering construction. Again, we reduce this model checking problem to 
the synchronous semantics and use MCHYPER to perform the verification. 

In a second experiment, we modified the system to send the value only once 
and checked it for termination insensitive noninterference SPl-term (see Sects. 3.2 
and 4.3). In our setup, we use the variable term to flag that the secondary has 
sent the full value. In the premise of the formula, we require that the input value 
is equal on both traces and again assume that the inputs conform to the SPI 
protocol. The conclusion checks if both secondaries have sent the same values 
by using additional variables that are set together with term. The results of this 
case study are summarized in Table 1(b). 


7 Related Work 


The study of specific hyperproperties, such as noninterference, dates back to the 
seminal work by Goguen and Meseguer [14] in the 1980s. The first systematic 
study of hyperproperties is due to Clarkson and Schneider [8]. 

It is well-known that classic specification languages like LTL cannot express 
hyperproperties. There are two principal methods with which the standard logics 
have been extended to express hyperproperties: 


— The first method is the quantification over variables that identify specific 
paths or traces. The temporal logics LTL, CTL* have been extended with 
quantification over traces and paths, resulting in the temporal logics Hyper- 
LTL and HyperCTL* [7]. There are also extensions of the p-calculus, most 


714 J. Baumeister et al. 


recently, the temporal fixpoint calculus H, [15], which extends the linear time 
p-calculus [3] with path quantifiers and indexed next operators. 

— The second method is the addition of the equal-level predicate E to first-order 
and second-order logics, like MPL, MSO, FOL, and S18, which results in the 
logics FOL[E], S1S[E], MPL[E], MSOJ[E] [9,13]. 


HyperCTL*, MPL[E], and MSOJ[E] are branching-time logics, we therefore 
focus in the following on the linear-time logics HyperLTL, H,,, FOL[E], and 
S1S[E]. Among these logics, HyperLTL is the only logic for which practical 
model-checking algorithms are known [10,11,17]. For HyperLTL, the algorithms 
have been implemented in the model checkers MCHyper and bounded model 
checker HyperQube. As discussed in this paper, HyperLTL is limited to syn- 
chronous hyperproperties. 

FOL[E] can express a limited form of asynchronous hyperproperties. As 
shown in [9], FOL[E] is subsumed by HyperLTL with additional quantification 
over predicates. Using such predicates as “markers,” one can relate different 
positions in different traces. However, only a finite number of such predicates 
is available in each formula. S1S[E] is known to be strictly more expressive 
than FOL[E] [9], and conjectured to subsume H, [15]. For S1S[E] and H,,, the 
model checking problem is in general undecidable; for H,,, two fragments, the k- 
synchronous, k-context bounded fragments, have been identified for which model 
checking remains decidable [15]. Even though some asynchronous properties can 
be expressed in these decidable fragments of H,,, there is no systematic study to 
characterize practical properties that can be encoded. Like S1S[E] and H,,, asyn- 
chronous HyperLTL has an (in general) undecidable model checking problem. 
However, in this paper we have identified decidable fragments of asynchronous 
HyperLTL that can express observational determinism, noninterference, and lin- 
earizability. A-HLTL is thus the first logic for hyperproperties that can express 
the major asynchronous hyperproperties of interest within decidable fragments. 
Furthermore, asynchronous HyperLTL is the first logic for asynchronous hyper- 
properties with a practical model checking algorithm. 


8 Conclusion 


We have introduced A-HLTL, a temporal logic to describe asynchronous hyper- 
properties. This logic extends HyperLTL with trajectory modalities, which con- 
trol when a trace proceeds and when it stutters. Synchronous HyperLTL corre- 
sponds to a trajectory that always moves all paths in a lock-step manner. This 
notion of trajectory allows to define formulas that are invariant under stuttering, 
paving the way for relevant model-checking optimizations such a partial order 
reduction and abstraction-refinement techniques in the context of hyperproper- 
ties. We show that model-checking A-HLTL formulas is in general undecidable, 
and identify two fragments of A-HLTL formulas, which cover a rich set of security 
requirements and can be decided by a reduction to HyperLTL model-checking. 
This in turn has allowed us to the reuse the existing model-checker MCHYPER. 
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Future work includes the study of larger decidable fragments (that encom- 
pass both fragments studied here), extending the logic allowing several trajec- 
tory modalities, as well as their implementation in practical tools. Extending 
bounded model-checking [17] to A-HLTL is another interesting research prob- 
lem. Asynchronous hyperproperties are important for applying a logic-based 
verification approach to verify hyperproperties for software programs, because 
the relative speed of the execution of programs depends on many factors like the 
compiler, hardware, execution platform and concurrent running programs, that 
the analysis must tolerate. Therefore, future work includes adapting techniques 
for infinite-state software model-checking, like deductive methods, abstraction, 
etc. to verify A-HLTL properties of software systems. 
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Abstract. Most existing program verifiers check trace properties such 
as functional correctness, but do not support the verification of hyper- 
properties, in particular, information flow security. In principle, prod- 
uct programs allow one to reduce the verification of hyperproperties to 
trace properties and, thus, apply standard verifiers to check them; in 
practice, product constructions are usually defined only for simple pro- 
gramming languages without features like dynamic method binding or 
concurrency and, consequently, cannot be directly applied to verify infor- 
mation flow security in a full-fledged language. However, many existing 
verifiers encode programs from source languages into simple intermediate 
verification languages, which opens up the possibility of constructing a 
product program on the intermediate language level, reusing the exist- 
ing encoding and drastically reducing the effort required to develop new 
verification tools for information flow security. 

In this paper, we explore the potential of this approach along three 
dimensions: (1) Soundness: We show that the combination of an encod- 
ing and a product construction that are individually sound can still be 
unsound, and identify a novel condition on the encoding that ensures 
overall soundness. (2) Concurrency: We show how sequential product 
programs on the intermediate language level can be used to verify infor- 
mation flow security of concurrent source programs. (3) Performance: 
We implement a product construction in Nagini, a Python verifier built 
upon the Viper intermediate language, and evaluate it on a number of 
challenging examples. We show that the resulting tool offers acceptable 
performance, while matching or surpassing existing tools in its combina- 
tion of language feature support and expressiveness. 


1 Introduction 


a 


Since computer programs increasingly handle sensitive user data and commu- 
nicate using encryption, it is vital that programs do not leak secret data such 
as private keys to attackers, that is, that they are information flow secure. One 
way of formalizing information flow security is noninterference, a so-called 2- 
hyperproperty, i.e., a property of pairs of executions of the program. 
© The Author(s) 2021 
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Noninterference can be checked by type systems [45] and static analyses [22]. 
However, complex language features (such as concurrency) and noninterference 
properties (such as termination sensitivity) generally require the expressiveness 
of deductive verification. In recent years, many automated and expressive verifi- 
cation tools have been developed for a wide range of programming languages, but 
most of these tools are limited to trace properties (properties of single program 
traces) and cannot prove hyperproperties such as noninterference. 

The problem we address in this paper is how to retrofit existing program 
verifiers to check noninterference. Compared to building noninterference veri- 
fiers from scratch, which can take years when targeting substantial subsets of 
real-world programming languages, this approach would allow us to reuse most 
aspects of existing verifiers, such as the semantic representation of language fea- 
tures and proof search algorithms. Moreover, it naturally allows one to verify 
combinations of correctness and noninterference properties. 

In principle, existing program verifiers can be used to verify hyperproperties 
by reducing them to trace properties via self-composition [6] or product pro- 
grams [4,5]. However, selfcomposition does not allow modular verification [48], 
and product programs have generally been defined only for simple languages 
without features like dynamic method binding or concurrency [4,18]. Applying 
product constructions to programs written in complex languages would therefore 
require defining and implementing new and complex product constructions for 
every new verifier. 

We explore a more efficient approach here: We leverage the fact that most 
automatic deductive verifiers are organized into a custom frontend, which 
encodes a source program into an intermediate verification language (IVL), 
and a reusable backend, which verifies the IVL program using generic proof 
search engines. Boogie [3], Viper [35], and Why3 [21] are examples of such IVLs, 
which power a large number of program verifiers; for instance Boogie is used by 
Dafny [29], VCC [13], Spec# [30], and GPUVerify [8], Why3 [21] by Frama-C [14] 
and Krakatoa [20], and Viper [35] by Vercors [10], Prusti [2], and Nagini [17]. 
The ubiquitiy of this architecture offers a chance to retrofit existing verifiers to 
check noninterference by performing the product construction on the level of the 
IVL (an approach that is already used by SymDiff [27] for the related problem 
of program equivalence). The resulting architecture, which allows one to reuse 
both the frontend and the backend of the existing verifier, is shown in Fig. 1. 

Performing the product construction on the [VL-level has three major advan- 
tages over a product construction on the source program: (1) It cleanly separates 
the encoding of the source language (which tends to be complex for full-fledged 
languages) from the product construction. (2) The product construction is much 
simpler since IVLs are small, sequential languages. (3) The product construction 
can be reused across all verifiers built on the same IVL. Overall, this architecture 
therefore has the potential to make existing verifiers information flow aware with 
substantially less effort than building a new tool from scratch. 

Even though this approach has strong advantages, there are several open 
questions that must be addressed to make it useful and widely applicable: 
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Fig. 1. Proposed architecture for information flow verifiers. The existing encoding from 
source to IVL (frontend) as well as the proof search (backend) can be reused. The 
product construction needs to support only the (relatively small) IVL and can be 
reused across different verifiers. 


1. Soundness: Given an IVL encoding and a product construction that are indi- 
vidually sound, is the resulting combination always sound as well? 

2. Concurrency: There is a substantial number of verifiers that verify concurrent 
source programs by encoding them into (sequential) IVLs. Can we soundly 
verify information flow security of concurrent programs based on the a prod- 
uct program of the sequential IVL encoding? 

3. Performance: Product constructions cause a performance penalty for verifica- 
tion. Does this overhead prevent the construction of useful verification tools 
in practice? 


In this paper, we answer these three questions. We focus our investigation 
on modular product programs [18], a kind of product program that allows mod- 
ular verification and is well-suited for precise specification and verification of 
information flow security. We make the following contributions: 


— We show that the combination of sound IVL encodings and sound product 
constructions can indeed be unsound in practically-relevant cases. We identify 
a novel condition on IVL encodings that ensures the soundness of the overall 
workflow. We show how to adjust existing unsound encodings on the example 
of a commonly-used encoding for dynamically-bound method calls (Sect. 3). 

— We show for the common case of data race free programs using locks that 
it is possible to verify both possibilistic and probabilistic noninterference for 
concurrent programs using sequential product programs. Furthermore, we 
demonstrate that existing criteria for verifying information flow security are 
insufficient in this setting; we provide alternative criteria that are sound and 
show how to encode them in a product program (Sect. 4). 

— We implement the approach for Nagini [17], an automated, modular verifi- 
cation tool for a large subset of Python, built on top of the Viper IVL [35]. 
We evaluate the performance impact of the product construction and show 
that, while worse than a custom-made information flow verifier, performance 
is acceptable for real-world use (Sect.5). Our implementation and evaluation 
are available as an artifact [16]. 
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These results demonstrate that the proposed approach can indeed be used to 
retrofit an existing verifier to soundly check information flow security, even for 
concurrent programs. The resulting tool, made with only a fraction of the effort 
required for the development of a new verifier, can compete with custom-made 
tools in its expressiveness at an acceptable performance cost. 


2 Preliminaries 


In this section, we introduce the necessary background about noninterference 
and product programs. 


2.1 Noninterference 


A common way of formalizing information flow security is noninterference [23]. 
Informally, noninterference specifies that the secret (or high) inputs of a pro- 
gram do not influence the values of its public (or low) outputs. We will not 
define a formal semantics here, but just assume that there is a steps-to relation 
(s,o0) — (s’,o’) that relates program configurations consisting of a store o and 
a statement s. 

We formalize noninterference as a property of pairs of program executions 
(that is, a 2-hyperproperty [12]) as follows: 


Definition 1. A program s with a set of input variables I and output variables 
O, of which some subsets I, C I and O, C O are low, satisfies noninterference 
iff for all 01,02 and o},05, if Va € T.oi(x) = o2(x) and (8,01) —>* (skip, o1) 
and (8,02) >* (skip, ch) then Vx € O1.01 (x) = 04(2). 


Note that in this definition (and throughout this paper unless stated other- 
wise), we do not consider non-terminating executions, i.e., we focus on verifying 
termination-insensitive noninterference. 


2.2 Modular Product Programs 


Proving hyperproperties requires reasoning about multiple (here, two) execu- 
tions of a program. However, hyperproperties can be reduced to properties of a 
single execution by using self-composition [6] or product programs [4]. The idea 
is to duplicate a program’s state space by creating two renamed copies of all 
variables, one for each execution (we write 2 for the ith renaming of variable 
x, and lift this notation to expressions), and to transform each statement so that 
it has the effect of the original statement on both copies of the state. Unlike self- 
composition, which achieves this effect by simply duplicating every statement, 
modular product programs [18] do not duplicate loops and method calls, and 
instead encode differing control flow through activation variables, which repre- 
sent, for each execution, whether or not it is active (i.e., it executes the code) at 
the current point in the program. This approach results in a structural alignment 
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def bar(z): 1 def bar(pl, p2, zl, z2): 
2 
def foo(x): 3 def foo(pl, p2, xl, x2): 
if x > 0: 4 plt = pl and x1 > 0 
y Si 5 p2t = p2 and x2 > 0 
else 6 ple = pl and not (x1 > 0) 
y=2 7 p2e = p2 and not (x2 > 0) 
bar(y > 0) 8 if plt: yl = 1 
9 if p2t: y2 = 1 
10 if ple: y1 = 2 
11 if p2e: y2 = 2 
12 if pl or p2: 
13 bar(pl, p2, yl > 0, y2 > 0) 


Fig. 2. A modular product program (on the right) of the program on the left. 


of both program executions, which allows one to use method specifications and 
loop invariants that relate both executions, as we discuss below. We denote the 
product of statement s under activation variables pl and p2 as [s]?. 

Figure 2 shows an example program and the respective product program. For 
both functions, the product program duplicates the parameters of the original 
function and adds boolean activation variables pl and p2. Control structures 
like conditionals are encoded by creating a set of new activation variables (lines 
4-7). For example, plt represents whether the first execution is active in the then- 
branch of the conditional, which is the case if it was active at the beginning of 
the function and the if-condition is true for the first execution. Conversely, p2e 
represents whether the second execution is active in the else-branch. Primitive 
statements like assignments are then executed under the condition that their 
execution is active at the current point in the execution (lines 8-11). Crucially, 
the method call to bar is not duplicated; it is executed if at least one execution 
is active at the call site, and the values of the current activation variables are 
passed to the function, meaning that if an execution is inactive at the call site, 
no state changes will be performed for that execution in the called method. 

Because a single method call in the product represents the calls from both 
executions, one can reason about method calls modularly in terms of relational 
specifications, i.e., specifications that relate behavior of two executions of the 
method, as opposed to unary specifications that describe only a single execution. 
Relational specifications are encoded as ordinary specifications in the product 
program that relate parameters from the two different executions. 

As an example, assume that bar prints the value of its input z, which must 
therefore be low. We can express this as a (relational) precondition low(z), which 
can be encoded as the precondition p1 A p2 = z1 = z2 in the product of bar. 

Events the attacker can observe (such as I/O) must not happen depending on 
a secret, to avoid leaking secret data. It is, thus, useful to express in specifications 
that the control flow at the current program point is low, i.e., whether the current 
statement is executed does not depend on secret data. This property is denoted in 
specifications as lowEvent. We generally write [P]? for the encoding of assertion 
P under activation variables p1 and p2; [lowHvent]? is then defined as p1 = p2. 
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A unary (that is, non-relational) predicate Q, such as a standard method 
pre- or postcondition, is encoded in the product program as applying to each 
active execution, i.e., [Q]? is defined as p1 > Q™ A p2> Q@). 

Compared to type systems and taint analyses, verification based on product 
programs allows for much more precise reasoning. Assume for example that foo’s 
parameter x is high. Nonetheless, we can show that the example does not leak 
information, since the precondition of bar, low(z), will always be fulfilled (y > 0 
is true independently of the value of x). In contrast, security type systems would 
flag y as high, since it is assigned to under a high guard, leading to imprecision. 

In addition to ordinary noninterference, modular product programs can also 
be used to encode more advanced security properties, including termination- 
sensitive noninterference, value-dependent sensitivity [36], and a form of declas- 
sification [18]. 


3 Sound Products of IVL Encodings 


In this section, we address the first question from the introduction, namely, 
whether we can soundly combine an existing encoding into an IVL with a product 
construction. We first describe the proposed architecture in greater detail. Then 
we show a potential soundness issue and define a sufficient criterion on the IVL 
encoding for the entire approach to be sound. Finally, we discuss an example of 
a common encoding pattern that violates the criterion, show that it is indeed 
unsound, and propose an alternative sound encoding. 


3.1 Proposed Architecture 


The architecture proposed in the introduction (Fig. 1) enables the construction of 
information flow aware verifiers with relatively little effort, by reusing most of the 
frontend encoding of the source language to an IVL as well as the entire backend 
proof search. The only major change that is necessary is that the frontend and 
potentially the IVL have to be extended to allow for the use of information flow 
assertions in specifications. Crucially, the frontend does not have to know their 
meaning; it can treat relational source-level assertions like low(e) like ordinary 
unary predicates and simply translate them to their counterparts on the IVL 
level. [VL-level relational assertions will then be translated to ordinary assertions 
during the product transformation. 

In the remainder of this paper, we will generally assume that the existing 
IVL encoding is used unchanged, and point out when changes need to be made. 


3.2 Soundness Issue 


Surprisingly, combining a sound encoding from source language to IVL with 
a sound IVL-level product construction may result in a verification technique 
that is unsound in the presence of relational specifications. Consider the source 
program in Fig.3 (left), where P is some predicate. 
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def foo(x): def foo(x): 
if x > 7: assert x > 7 ? P(5) : P(7) 
y=5 
else: 
y=7 


assert P(y) 


Fig. 3. Example of an encoding that is unsound in our setting. The program on the left 
can be encoded into a conditional statement (identical to the source program, modulo 
language syntax) or to the program on the right; the latter leads to unsoundness if P 
is a relational predicate. 


A frontend could encode the body of foo into an identical (modulo syntax) 
conditional statement on the IVL level (assuming the IVL provides conditionals, 
assignments, and assert statements). Alternatively, it could produce the encoding 
shown in Fig.3 (right), which directly asserts a sufficient precondition of the 
source program. If P is a unary predicate, both encodings are sound: If they 
verify, the original program is correct. However, if P(y) is a relational predicate, 
for instance, low(y), then the encoding on the right is unsound: low(5) and low(7) 
are trivially true (since 5 = 5 and 7 = 7), so the assertion in the encoded program 
trivially passes, yet the original program is clearly incorrect: If x is greater than 
7 in one execution but less in the other, y will have different values in both 
executions, and will therefore not be low. 

The underlying reason is that the encoding on the right does not encode 
the exact behavior of the source program; it encodes a verification condition 
computed by the frontend that is sound if assertions are unary, but may not be 
sound for relational assertions. 

We will now (1) formalize this intuition and derive a sufficient condition for 
the soundness of an encoding in this approach, and (2) show an example of this 
problem occurring in real frontends, and describe how it can be solved. 


3.3 Soundness Criterion 


We write X and S$ for states and statements of the source language, and o and s 
for states and statements of the IVL. States may contain, for example, a mutable 
heap and a variable store. For simplicity, we assume that both source and IVL 
statements contain a statement skip that represents a finished computation. We 
also assume that there is a small-step transition relation — for both languages, 
and that the standard notion of Hoare triple validity F {P}s{@Q} is defined for 
the IVL. We let P and Q range over (source and IVL level) assertions from a 
standard assertion language extended with low(e) and lowEvent, and assume a 
standard definition of assertion validity for pairs of states. 

We define an encoding to be a triple (a, =, 3), where a: S — s is an encoding 
from source statements to statements of the target language (i.e., the IVL), 8 
similarly encodes assertions to the target language, and = relates source language 
states to corresponding target language states. 


Product Programs in the Wild 725 


We first define the desired relational soundness property, which expresses 
that if an encoded Hoare triple holds for the encoded program, then the original 
property holds for all pairs of executions of the source program: 


Definition 2. (a, S, 3) is relationally sound iff, for all S, X1, X2, X1, X5, P,Q, 


if F {[3(P)]?}la(S)}?{[6(Q)]?} and 21,22 E P and (S, 21) —* (skip, X1) 
and (S, X2) >* (skip, X5), then Xi, 05 Q. 


Product programs represent the operational behavior of two program execu- 
tions by the operational behavior of a single product program execution. The 
unsoundness shown before is caused by the fact that the encoding into the IVL 
does not reflect the operational behavior of the conditional statement (replacing 
it by an assertion of a sufficient precondition) and, thus, the resulting product 
does not soundly reflect two executions of the source program. 

We call an encoding that preserves the operational behavior of the source 
program operational: It encodes every step of the source program into some 
number of steps of the target program so that c initial states result in matching 
end states. Similarly, it encodes specifications from the source level into target- 
level specifications that hold in matching states. We can formalize this intuition 
by requiring that the source and target programs are connected by the simulation 
relation =: 


Definition 3. (a, =, 8) is an operational encoding if: (1) for all X, X”,a,S,S", 
if (S, 3’) > (S’, 5") and X S oa, then (a(S),a) 3* (a(S"),0’) for some o’ s.t. 
'o', and (2) if 2 So then VE P iffoF B(P). 


Note that this notion allows the encoding to overapproximate the behaviors 
of the source program, i.e., admit steps that are not possible on the source level, 
but not vice versa. 

For the example in Fig. 3, it is easy to see that this criterion is fulfilled by 
the left encoding: the source and IVL programs are identical (modulo syntax), 
matching states are identical states (modulo state encodings), and the behav- 
ior of both programs is identical. The encoding on the right, however, is not 
operational: While the left program modifies the state, the right program never 
performs any state modification. 

We now show that operationality is sufficient for relational soundness: 


Theorem 1. If (a,&, 3) is operational then it is relationally sound. 


Note that operationality is a sufficient but not necessary condition; encodings of 
verification conditions may be sound for relational verification as well. The main 
advantage of applying the operationality criterion instead of directly reasoning 
about relational soundness is that, since operationality represents the simple 
notion that the IVL program performs equivalent steps and equivalent state 
changes to the source program, it is intuitive and easy to check whether a given 
encoding is operational. Additionally, some encodings (like the one Vercors uses 
for parallel blocks) are not operational, but can be seen as simplified versions of 
a possible operational encoding that generate the same proof obligations; these 
can also be quickly identified as relationally sound. 
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3.4 Practical Relevance 


In most existing frontends, the encoding of virtually all source language con- 
structs is operational; the main appeal of IVLs is, after all, that frontends do 
not have to compute verification conditions, but can instead “compile” input pro- 
grams into an IVL without worrying about the verification process itself. How- 
ever, many frontends still use non-operational encodings at least for some lan- 
guage constructs. Examples for this are VCC’s encoding of local blocks, Dafny’s 
encoding of calls on traits, Prusti’s encoding for loops, and Nagini’s encoding of 
dynamically-bound calls, which we will discuss in detail in the next subsection. 
Additionally, as we will discuss in Sect.4, all encodings of concurrent source 
languages into sequential [VLs necessarily have some non-operational elements. 

Where non-operational encodings are used, this is often intentional to enable 
modular verification, since operational encodings for some language constructs 
are inherently non-modular (see the example in the next subsection). In prac- 
tice, one can therefore use the operationality criterion to quickly check that the 
existing encoding is sound for the vast majority of source language statements, 
and subsequently check the few remaining ones for relational soundness in detail. 


3.5 Example: Dynamically-Bound Calls 


In this section, we show a real example of an unsound encoding of dynamically- 
bound calls that violates the operationality criterion, and show how to derive a 
sound alternative. 

Statically-bound method calls, i.e., calls whose exact target is fixed at com- 
pile time, can be encoded as procedure calls on the IVL level, which yields an 
operational encoding if the operational semantics of the IVL treats calls analo- 
gously to the source semantics. The IVL verifier might later reason about calls in 
terms of pre- and postconditions instead of actually performing a call, but this 
transformation is not relevant here as long as the product program is constructed 
before such a desugaring step. 

However, the same approach does not work for dynamically-bound calls, 
i.e., calls whose target is chosen at runtime based on the type of the call’s 
receiver. Since the implementation to be executed is generally not known during 
modular verification, it is not possible to encode dynamically-bound calls as 
procedure calls with the usual operational semantics (and existing IVLs do not 
offer dynamically-bound calls). Therefore, dynamically-bound calls are typically 
(e.g., in Dafny and Nagini) directly encoded using the method specification. 
Additional, separate proof obligations enforce that all overrides of a method 
respect behavioral subtyping [33], i.e., live up to the specification of the overridden 
method. 

Consider method A.foo in Fig. 4 (left), which returns a constant integer and 
guarantees in its postcondition that the result is low. A dynamically-bound call 
a.foo(), where a has the static type A, will be encoded as an assertion of the 
(here, trivial) precondition of A.foo, followed by an assumption of the postcon- 
dition (we ignore side effects here for simplicity). 
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class A: class B(A): 
def foo(self) —> int: def foo(self) —> int: 
# ensures low(result) # ensures low(result) 
return 0 return 1 


Fig. 4. Example of a problematic method override. B.foo overrides A.foo and has a 
compatible specification, but the implementations return different values. 


This encoding is sound if foo has a purely unary specification, without any 
relational parts. However, it does not fulfill our operationality criterion: The 
semantics of the source program performs a call to an implementation of foo 
(selected based on the dynamic type of a), whereas the IVL encoding directly 
encodes the proof obligations (similarly to the example from Fig. 3). 

Since the encoding is not operational, we have to check whether it is still 
relationally sound. Method B.foo in Fig. 4 (right), which overrides A.foo, shows 
that it is not. B.foo’s contract is identical with that of A.foo, so behavioral 
subtyping holds trivially. B.foo’s implementation satisfies the contract because 
it also returns a constant (but, importantly, a different one). Now, if a client 
calls a.foo() and, depending on a secret, the dynamic type of a is either A or B, 
then, depending on the secret, the result will be either 0 or 1. With the standard 
encoding of dynamically-bound calls outlined above, however, the client will 
assume the postcondition of A.foo and will therefore incorrectly conclude that 
the returned result is low. 

To avoid this unsoundness while retaining the ability to use relational specifi- 
cations!, the problematic encoding must be replaced, either with an operational 
one, or with a different non-operational encoding that is sound for relational spec- 
ifications. The former option is not applicable here: An operational encoding for 
dynamically-bound calls would essentially have to case split on the dynamic type 
of the receiver and invoke the appropriate override. Since such an encoding is 
inherently non-modular (all possible overrides need to be known), we follow the 
alternative option: we give an example of a non-operational, but sound encoding. 

For our new encoding we exploit the fact that the standard encoding is 
unsound only if the two executions of the program resolve the dynamically- 
bound call to two different implementations, that is, if the dynamic types of the 
receiver differ in the two executions. We reflect this observation by adjusting 
the encoding of pre- and postconditions as follows: (1) If the postcondition of 
a method guarantees that an expression is low, we assume this at the call site 
only if the dynamic type of the receiver is also low, that is, the calls in the two 
program executions are resolved to the same implementation. (2) Similarly, if a 
precondition requires that the call is a low event, we enforce that the receiver 
type is low in addition to the usual criterion for low events. Low events typically 
perform observable behavior such as I/O; it is therefore important that the same 
observable behavior is produced, independent of the receiver type. The meaning 
of low-assertions in preconditions remains unchanged, because the requirement of 


1 One could, of course, forbid the use of relational specifications in some places to 
trivially avoid the unsoundness; this, however, is typically not useful in practice. 
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a method to receive low arguments is independent of the invoked implementation 
and must, thus, not be weakened. lowEvent-assertions are generally not allowed 
in postconditions, where they add no expressiveness. 

We encode this adjustment as follows: 


[towle) pon. = @O =a A type(r™) = type(r™)) > e = 
[lowBvent]?,... =p = p? A type(r™) = type(r) 


pre, 

where type(e) represents the dynamic type of expression e, [P] a is the encod- 
ing of P in the postcondition of a call with receiver r, and [P] Poe represents the 
same for the precondition. We leave the remaining encoding untouched, meaning 
that we can summarize the resulting encoding as follows: 


1. We keep the existing check for behavioral subtyping for all overrides; this 
prevents, for example, that A.foo is overridden with a method that simply 
returns a secret value and therefore leaks information into the result. 

2. We keep the existing encoding of dynamically-bound calls as an assert fol- 
lowed by an assume, but interpret low(e) in preconditions and lowEvent in 
postconditions as shown above. 


In the example above, this encoding lets the caller assume that the result is 
low only if it can prove that the dynamic type of a is low. 
The adjusted encoding is indeed sound: 


Theorem 2. Let Se be of the form x:=r.m(), where r has static type A, and 
let prea.m and posta m be the pre- and postcondition of A.m. Assume that the 
implementation of A.m and its overrides fulfill their specifications and satisfy 
behavioral subtyping. Then the described encoding of Se is relationally sound. 


Note that this encoding is incomplete, since it is not aware that two different 
receiver types can lead to the same implementation being called (e.g., if one 
type inherits from the second and does not override the called method). Alter- 
native encodings could explicitly represent this possibility. Conversely, one could 
approximate further (while remaining sound) by requiring the receiver values to 
be low, not just their types, in encodings that do not model dynamic types. 


4 Product Programs and Concurrency 


Automated verification of information flow security for concurrent programs is 
challenging because one needs to reason about a pair of executions that may have 
different thread interleavings. In fact, we are aware of only one tool that currently 
allows this, SecC, which automates SecCSL, a concurrent separation logic for 
information flow security proofs [19]. A product construction applied directly 
to concurrent programs would have to faithfully represent all combinations of 
potential thread interleavings, which makes verification infeasible. Consequently, 
to the best of our knowledge, no such product construction exists. 
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For trace properties, many existing verifiers avoid reasoning about all possible 
thread interleaving by employing a program logic (such as concurrent separation 
logic [38]) that essentially reduces verification to sequential reasoning and allows 
concurrent verification problems to be encoded into sequential [VLs. Examples 
for such verifiers include Vercors and Nagini (using the Viper IVL), as well as 
Chalice [32], VCC, and Spec# (using the Boogie IVL). 

In this section, we show how to use IVL-level product programs to extend 
such verifiers to handle information flow. We first describe how existing IVL 
encodings for concurrent languages work, and subsequently show how we can 
use similar principles to apply an IVL-based product construction, and which 
additional proof obligations we must fulfill to ensure that no flows exist as a 
result of concurrency. We will do this for two different notions of information 
flow security for concurrent programs, possibilistic and probabilistic noninterfer- 
ence; however, the principles behind the approach may also extend to alternative 
notions of information flow security such as observational determinism [49]. 

Our goal is to describe a technique that applies to a wide range of source 
languages, IVLs, proof techniques, and encodings. Therefore, we focus on the 
high-level concepts, instead of formalizing them for one specific setting. 


4.1 Concurrent IVL Encodings 


Since all IVLs we are aware of are sequential languages, encodings from con- 
current source languages to IVLs do not model the exact behavior of the origi- 
nal language, in particular, the aforementioned thread interleavings (i.e., these 
encodings are non-operational). Instead, they encode a verification condition 
that ensures that the original program is correct for every possible thread inter- 
leaving. 

While the exact proof techniques differ between frontends, and can be based 
for example on Concurrent Separation Logic (CSL) [38] or ownership [13, 24, 26], 
they generally follow a common pattern [31]: They prove that the source pro- 
gram is data race free, which ensures that thread interactions need to be con- 
sidered only at well-defined synchronization points, for instance, upon acquiring 
or releasing a lock. The code between such interaction points can be considered 
to execute without interference from other threads, and thus can be reasoned 
about as if it were sequential. 

We focus on locks here, but other synchronization primitives are handled 
analogously. Program logics based on CSL or ownership systems formally connect 
a lock and the heap locations it protects, such that these locations may be 
accessed only while holding the respective lock. In addition, they associate locks 
with an invariant that constrains the values of the heap locations it protects. 
When acquiring a lock, a thread may assume that this lock invariant holds, 
and when releasing a lock, it has to prove that the invariant is reestablished. A 
frontend can encode this into an IVL as depicted in Fig. 5. 

Our solution for information flow verification in concurrent programs follows 
the same basic approach: We exploit that code between lock operations can 
be considered to execute without interference, and that we can therefore use 
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enc(|.acquire()) = // gain access to protected memory; 
assume Inv(|) 

enc(|. release ()) = assert Inv(1); 
// lose access to protected memory 


Fig. 5. Standard IVL encoding of lock operations. Inv(|) denotes the invariant con- 
straining the memory protected by lock |. 


ordinary sequential product programs to reason about this code. To capture 
the thread interactions at synchronization points, we extend lock invariants to 
contain relational assertions (which can prescribe that some values protected by 
the lock are low), and add additional checks around lock operations to ensure 
that they do not give rise to unwanted information flow. 


4.2 Possibilistic Noninterference 


For concurrent programs, standard noninterference is too strict a prop- 
erty because concurrent programs are usually non-deterministic. One way of 
approaching this problem is to instead verify possibilistic noninterference, which 
enforces that high information does not influence the possible values of low out- 
puts, i.e., if some combination of low output values is reachable from an initial 
state, then the same combination of low output values must still be reachable 
using some possible thread schedule after arbitrarily changing the high inputs. 
Possibilistic noninterference can be defined as follows: 


Definition 4. A program s with a set of input variables I and output variables 
O, of which some subsets I C I and O, C O are low, satisfies possibilistic 
noninterference iff for all o,,02 and oj, if Vx € T.o1(«) = o9(x) and (s, 01) >* 
(skip, oa) then (s,02) —>* (skip,o4) for some of s.t. Va € Oj.0}(2) = 04(2). 


Note that this property allows high inputs to influence the probability of 
different outputs and may therefore not be desirable in all scenarios; we discuss 
a stronger notion of noninterference in the next subsection. 

Since we build on a proof technique that ensures data race freedom, we can 
see each program trace as a sequence of local operations and lock operations by 
specific threads, where (1) every local operation depends only on previous (local 
or lock) operations of the same thread, and (2) every lock operation depends 
only on the previous local operations of the same thread and all previous lock 
operations (of arbitrary threads). As a result, we can (akin to partial order 
reduction) rearrange segments freely as long as we retain the overall order of lock 
operations and the order of operations of every specific thread; in particular, we 
can rearrange a trace so that it consists of a number of segments, such that in 
each segment, one thread executes any number of local operations and then one 
lock operation. 
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poss(|. acquire ()) = assert lowEvent; 
assert low(l); 
assume Inv(1) 
poss(|. release ()) = assert lowEvent; 
assert low(l); 
assert Inv(1) 
poss(while (e) do {s}) = assert low(e); 
while (e) do {poss(s); assert low(e)} 


Fig. 6. Statement encoding for possibilistic information flow security. For loops, we 
check that the loop guard is low, ensuring that termination is also low. 


Based on this observation, we impose proof obligations that ensure the fol- 
lowing property: For every program trace with some schedule and some low and 
high inputs, and for arbitrary different high inputs, there exists a second trace 
such that: (1) Both traces include the same lock operations performed by the 
same threads, in the same order, and (2) at each lock operation, the lock’s invari- 
ant holds; in particular, the relational assertions of the lock invariant correctly 
relate the state protected by the lock in both traces. 

To enforce this property, we devise four proof obligations that can be checked 
thread-locally: 


1. Every lock operation o is a low event, i.e., if a thread executes o in the first 
execution, it will also execute o in the second execution. 

2. Termination of the local code before the lock operation does not depend on 

secret data; i.e., if lock operation o is reached in the first trace, it will also be 

reached in the second trace. 

o operates on the same lock in both executions, i.e., the lock is low. 

4. If o releases the lock, i.e., makes a new lock state public, this lock state fulfills 
the relational invariant, meaning that heap operations meant to be low are 
indentical in both executions after the lock operation. 


ga 


Note that, even though the lock operations of both traces are closely aligned, 
their local operations may differ. For instance, a thread may branch on a high 
guard as long as no lock operation is performed before the control flow re-joins. 

The above checks are sufficient to satisfy Definition 4. The proof goes by 
induction on the number of segments of the traces and leverages the soundness 
of sequential verification within each segment. 


Encoding. The aforementioned checks can simply be checked as part of the 
encoding of lock operations. We adjust the encoding from Fig. 5 for possibilistic 
noninterference as shown in Fig. 6. For thread acquire and release, the assertions 
of lowEvent and low(l) directly ensure properties (1) and (3). Assuming and 
asserting the lock invariant works as in the standard IVL encoding for concurrent 
programs, but now this invariant can be relational, ensuring property (4). The 
condition on while loops is used to ensure property (2), which can be done simply 
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def main(secret: bool) —> None: 
c = Cell() 
| = CellLock(c) 
|. acquire () 
c.val = 4 
if secret: 
. release () 
. acquire () 
c.val = 5 
c. release() 


Fig. 7. Possibilistic information flow violation via a secret-dependent lock release. The 
Cell state 4 is visible to other threads only if secret is True. 


by asserting that the loop condition is low for every loop in the program (we 
assume, for simplicity, that there is no infinite recursion). 


Discussion. With our verification technique, the product construction on the 
IVL level does not need to be aware of concurrency in any way; applying the 
standard sequential product construction to the updated encoding is sufficient 
to ensure possibilistic noninterference in concurrent programs. 

To the best of our knowledge, we are the first to consider possibilistic infor- 
mation flow in a setting with locks, and therefore the first to propose that the 
order of lock operations must be constrained. The example in Fig. 7 demonstrates 
that this requirement is indeed necessary to prevent unwanted information flow: 
The CellLock protects the val field of a Cell object, which is intended to be low. 
The code unconditionally sets the field to two constants (first to 4, then to 5), 
which should be allowed since the constants are low. However, whether the lock 
is released while the cell has value 4 depends on a secret. As a result, when a 
different thread acquires the lock and sees that the value is 4, this leaks that the 
secret must have been true. 

Another example that illustrates the requirement to ensure that high data 
does not influence which lock a lock operation accesses can be found in Fig. 8. 
Here, two locks are created, and thread 1 acquires the first one. Thread 2 
acquires, depending on the secret, either the same lock or a different one. This 
influences the possible results of the program: If both threads acquire the same 
lock, then the print statements of one thread cannot be interleaved with those 
of the other, otherwise they can. As a result, if the attacker observes the pattern 
1212 (or any other interleaving of 1s and 2s), they know with certainty that the 
two threads acquired different locks and secret must therefore be False. 

The necessity to prevent termination differences in a concurrent setting has 
been recognized before in work on security type systems [45]. 


Product Programs in the Wild 733 


def thread1 (l: Lock) —> None: def main(secret: bool) —> None: 
# requires lowEvent 11 = Lock() 
|. acquire () 12 = Lock() 
print (1) if secret: 
print (1) }= 11 
|. release () else: 

| = 12 

def thread2(1: Lock) —> None: fork thread1 (11) 
# requires lowEvent fork thread2(1) 
l.acquire() 
print (2) 
print (2) 


|. release () 


Fig. 8. Possibilistic information flow violation through locks. If secret is true, both 
threads acquire the same lock, and their critical sections cannot be interleaved. 


def thread1(I: Lock, c: Cell): def thread2(Il: Lock, c: Cell, secret: int): 
ctr = 0 ctr = 6 
for i in range(100): for i in range(secret ): 
ctr t= 1 ctr f= 1 
|. acquire () |. acquire () 
c.val = 1 c.val = 2 
l.release() l.release() 


Fig. 9. Example of probabilistic information flow. With a non-deterministic scheduler, 
secret does not influence the set of possible outputs, but a greater secret leads to 
higher probability of seeing a final cell value of 2. 


4.3 Probabilistic Noninterference 


Possibilistic noninterference is too imprecise for many applications. Figure 9 illus- 
trates the problem: The final value of c.val can be either 1 or 2, that is, possi- 
bilistic noninterference holds. However, with most schedulers, a final value of 2 
is much more likely for greater secret values than for lower values because the 
assignment of 1 is more likely to happen before the assignment of 2. 

A stronger notion of noninterference that forbids such leaks is probabilistic 
noninterference, which requires that two executions from low-equivalent initial 
states will produce the same low outputs with the same probabilities. 


Definition 5. A program s with a set of input variables I and output variables 
O, of which some subsets I C I and OQ; C O are low, satisfies probabilistic 
noninterference iff for all 01,02 and o}, if Yx € l. oi(x) = c2(x) and (s, 01) >* 
(skip,o{) with probability p then (s,o2) —>* (skip,o4) with probability p for 
some on s.t. Yx € Oj.04 (£) = 04 (x). 


The information flow in Fig. 9 is caused by secret data influencing the tim- 
ing of thread 2, which in turn may affect the relative order of modifications of 
shared variables. To prevent secrets from influencing the timing of operations, we 
additionally assert that every branch condition in the program is low, meaning 
that the two executions will always follow the same code path, which leads to 
the adjusted encoding in Fig. 10. Note that the check that branch conditions are 
low must also be performed for any implicit branches; e.g., with the encoding of 
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prob(|. acquire ()) = assert low(l); 
assume Inv(|) 
prob(l. release ()) = assert low/(l); 
assert Inv(|) 
prob(while (e) do {s}) = assert low(e); 


while (e) do {prob(s);assert low(e)} 
prob(if (e) then {sı} else {s2}) = assert low(e); 

if (e) then {prob(si)} else {prob(s2)} 
prob(r.m()) = assert low(type(r)); 

r.m() 


Fig. 10. Statement encoding for probabilistic information flow security. 


dynamically-bound calls shown before, we must now assert that the type of the 
receiver of every such call is low. Also note that since we enforce that branches 
are low, the lowEvent conditions we showed in the possibilistic encoding will 
be trivially fulfilled and can be omitted here. However, we still need to assert 
that acquired and released lock references are low. This last requirement has not 
been discussed in previous work (whereas forbidding high branches is standard 
practice in type systems and program logics [36]). 

With this adjusted encoding, probabilistic noninterference can be verified 
using simple assertions in the IVL encoding and subsequently performing a stan- 
dard product construction on the IVL level. So, in summary, this approach lets 
us extend existing verifiers for concurrent programs to verify both possibilistic 
and probabilistic noninterference with very small changes in the frontend, and 
without requiring any changes on the level of the IVL (except the ability to write 
relational specifications) and the product construction. 


5 Implementation and Evaluation 


In this section, we evaluate the performance of the proposed architecture, by 
extending the previously information flow unaware Nagini verifier for Python [17] 
according to our design. We will first briefly describe Nagini and the adaptations 
we needed to make, then evaluate the performance overhead generated by the 
product transformation, and subsequently evaluate the implementation on a 
number of information flow examples, comparing it to SecC [19] in the process. 


5.1 Nagini 


Nagini is an automated verifier for statically-typed Python 3 programs. It sup- 
ports a large subset of the Python language, comprising features like exception 
handling, polymorphism, dynamic field creation, and concurrency. Reasoning 
about some of these features is quite intricate even without the overhead of a 
product construction, so we believe that Nagini is a good target to evaluate the 
performance of the proposed architecture for verifiers for complex languages. 
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Nagini encodes Python programs and their specifications into the Viper IVL 
[35], and then uses Viper’s backend verifiers to automatically verify those pro- 
grams using the Z3 SMT solver [34]. For concurrent programs, Nagini uses 
an encoding similar to the one described in Sect.4, using implicit dynamic 
frames [44] (a flavor of separation logic [38,40]) to prove data race freedom; as a 
result, we could modify its existing encoding as shown in Sect. 4 to prove both 
possibilistic and probabilistic noninterference for concurrent programs. Nagini’s 
existing encoding from Python to Viper is almost entirely operational, we only 
adapted the encoding of dynamically-bound calls as shown in Sect. 3.5. 

We extended Nagini’s existing specification language to include information 
flow specifications and implemented the modular product program transforma- 
tion for 2-hyperproperties for the existing Viper AST (enriched, again, with new 
AST nodes for information flow specifications). For convenience, we also slightly 
extended the Viper-based product transformation to directly transform state- 
ments that Nagini previously encoded using gotos, such as break and continue 
statements. The Viper extension for product programs? and the extended version 
of Nagini? are open source and available online. 


5.2 Performance Overhead of the Product Construction 


Our first goal is to evaluate the performance overhead generated by the product 
construction. We compared the verification times of Nagini’s entire functional 
test suite with and without the product transformation enabled. The test cases 
range from small programs targeting specific language or specification constructs, 
to realistic code examples taken from programming tutorials. We ran each test 
five times on a warmed up JVM with the information flow extension enabled and 
disabled, without adding any information flow specifications. Our test system 
was a 12 core AMD Ryzen 3900X with 32 GB of RAM running Ubuntu 20.04.1. 

All tests report the same results with and without the product transforma- 
tion, meaning that completeness is not impacted by the extension, and that we can 
indeed still reason about the entire language subset supported by Nagini. With- 
out the product transformation, each test case takes between 3 and 9s, with the 
majority taking between 3 and 5. For most cases, enabling the product construc- 
tion leads to an increase in verification time that is clearly acceptable (less than 
11% for half the tests, less than 30% for three quarters, and less than 100% for 90% 
of the tests). For five test cases, the slowdown is a factor between 5 and 12, anda 
single outlier (a quicksort implementation) has a slowdown factor of 17.5 and a 
resulting verification time of two minutes. We believe that the main reason for the 
large slowdown for these particular test cases is the use of quantifiers in their spec- 
ifications (e.g., to specify properties of all elements in a list). Quantifier handling is 
difficult for automated verification in general, because unbounded chains of quan- 
tifier instantiations can occur during the proof search [15], and this problem seems 
to be exacerbated when using the product encoding. 


? https: //github.com/viperproject /silver-sif-extension. 
3 https: //github.com/marcoeilers/nagini. 
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Table 1. Programs evaluated for proving information flow security. We show the total 
lines of code (LOC) including implementation and specification but excluding whites- 
pace, lines of specification and proof annotation (Ann.), the property we proved (Prop., 
where NI = noninterference, TNI = termination sensitive noninterference, PS = pos- 
sibilistic noninterference) and the verification time in seconds (T), averaged over five 
runs. 


LOC | Ann. | Prop. T 
SSE EEE LOC | Ann. | Prop. T 
banerjee 77 21 NI 5.19 = 
kusters 28 12 NI 4.35 
constanzo 21 12 NI 5.39 
naumann 27 17 NI 8.46 
darvas 38 18 NI 4.20 
product 39 18 NI 11.35 
example 27 12 NI 5.39 - 
smith 39 21 NI 6.81 
Example-decl | 27 12 NI 5.76 F 
terauchil 10 3 NI 3.59 
Example-term| 8 4 TNI | 3.59 : 
- terauchi3 19 6 NI 3.69 
oana-1-t] 22 T NI 3.87 - 
- terauchi4 18 8 NI 3.97 
oana-2-bl 13 5 NI 3.64 Fin 4 19 5 NI 3.82 
ig. : 
oana-2-t 12 4 NI 3.72 £ 
- loop leak [45] | 53 17 PS 4.92 
oana-3-bl 36 15 TNI | 3.55 - 
- high loop 24 11 PS 4.19 
oana-3-br 33 14 TNI | 4.60 : 
- Fig. 7 23 8 PS 4.37 
oana-3-tl 23 9 TNI | 4.50 i 
; Fig. 8 36 |15 |PS 4.40 
oana-3-tr 25 10 TNI |4.19 Fie. 9 34 15 PS L57 
ig. : 
joana-13-1 1 | 2 |NI [454 £ 


We conclude that the performance impact of the product transformation is 
acceptable for most examples, but can be significant for programs with complex 
functional specifications. 


5.3 Expressiveness and Comparison with SecC 


In a second step, we evaluated the expressiveness and performance of our imple- 
mentation on a number of challenging examples from the literature. In partic- 
ular, we use the examples from the original paper about modular product pro- 
grams [18] (sequential examples collected from various previous papers, trans- 
lated to Python) and from this paper, both shown in Table 1, as well as examples 
taken from SecC [19], the only other automated verification tool for concurrent 
programs we are aware of, shown in Table 2. The latter table includes the CDDC 
case study [36], which models an embedded device that interacts simultaneously 
with multiple users and classified networks. Our examples represent the state 
of the art in automated information flow verification, requiring semantic rea- 
soning that would not be possible in a type system, and using complex infor- 
mation flow specifications including declassification, termination-sensitive non- 
interference, and value-dependent sensitivity [36]. As mentioned before, these 
features can be easily encoded into modular product programs using existing 
techniques [18]. 

Nagini was able to verify all examples, which demonstrates that our approach 
can handle concurrent implementations and express complex noninterference 
properties. For the examples from Table 1, Nagini takes only between 3 and 12s 
each. As for the tests from SecC, Nagini takes around five seconds for three 
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Table 2. Comparison with SecC. We show the total lines of code and lines of speci- 
fication for Nagini (LOCy, Anny) and SecC (LOCs, Anns), the property we proved 
(Prop., where NI = noninterference, PR = probabilistic noninterference) and the veri- 
fication time in seconds in both tools (Ty and Ts) and in Nagini without the product 
construction (Typ), averaged over five runs. 


LOCy | Anny | LOC | Anns | Prop. | Ts TN TNP 
SecC CAV 40 13 50 11 PR 1.33, 4.21 | 3.56 
SecC CDDC | 278 105 214 47 {PR | 21.20) 52.20 | 8.60 
SecC CT 64 35 211 159 PR 1.87| 5.41 | 3.97 
SecC DB 100 48 256 167 = NI 2.75 | 182.60 | 6.23 
SecC Encrypt | 29 12 49 18 NI 1.45) 4.76 | 3.66 


of them, 52s for the CDDC case study, and 183s for an example involving a 
large number of quantifiers. We believe that 52s for a complex case study is 
still acceptable, whereas the slowest example demonstrates that extensive use of 
quantifiers will lead to problematic performance in practice. 

Table2 shows that SecC is much faster than our implementation. How- 
ever, SecC was designed and implemented for information flow verification from 
scratch, without being able to reuse code from an existing verifier, whereas 
our extended Nagini implementation could be implemented with minimal effort. 
Besides this crucial difference, Nagini and SecC differ in many other ways, e.g., in 
their supported language features, automation (see the table for required anno- 
tations), and specification styles. As a result, direct performance comparisons 
between the two are difficult; in fact, the unmodified version of Nagini without 
the product construction already takes more time than SecC on four out of five 
examples, likely as a result of the overhead required for modeling more complex 
language features. 


6 Related Work 


There are various existing type systems (e.g., [37,45]) and static analyses 
(e.g., [11,22]) for proving information flow security. Compared to verification 
based on product programs, these are more automated, but less precise. More- 
over, there are dedicated program logics for information flow verification, such 
as SecCSL [19], Covern [36], and Veronica [42], all of which allow proving prob- 
abilistic noninterference for concurrent programs based on different reasoning 
techniques. The implementation of the former in SecC is the only existing tool 
that automates information flow verification for concurrent programs, see Sect. 5. 

Relational logics, such as Relational Hoare Logic [7] and Cartesian Hoare 
Logic [46], allow proving general relational program properties, which includes 
noninterference. However, while they tackle a more general problem, they gen- 
erally work only for sequential programs. Some tools automate information flow 
verification using self-composition, e.g., for C [9] and for Java [41]. Compared to 
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modular product programs, this approach generally does not allow for modular 
proofs of information flow security [18, 48]. 

Modular product programs were presented by Eilers et al. [18]. Other forms of 
product programs differ in the way executions are interleaved. While some keep 
executions in lock step [4], like modular product programs, others do not describe 
a deterministic product construction and allow for arbitrary interleavings [5]. In 
particular, Shemer et al. [43] propose property-directed self-composition, which 
dynamically determines how to compose and interleave different executions 
based on the property to be verified. Similarly, Strichman and Veitsman [47] 
propose a product-like construction that interleaves recursive functions whose 
executions are not in lock step. Recently, Pick et al. [39] showed how to auto- 
matically infer information flow specifications on modular product programs, 
which can likely be combined with the approach examined in this paper. 

To the best of our knowledge, SymDiff [27] for the Boogie IVL is the only 
existing tool that constructs product programs on an IVL-level. SymDiff is a 
tool for differential program verification, which requires reasoning about pairs of 
executions of two different (but related) programs and is thus similar to hyper- 
property verification; in fact, SymDiff has also been used to verify noninterfer- 
ence in the past [1]. The authors of SymDiff have proposed different techniques 
for modularly proving mutual function summaries, similar to relational specifi- 
cations, one of which uses a kind of product construction [25,28]. However, they 
do not examine potential soundness problems arising from this approach, nor do 
they discuss if it can be applied to concurrent source programs. 


7 Conclusion 


We presented an approach for retrofitting existing [VL-based program verifiers 
to check information flow security using product programs. This approach allows 
reusing existing frontends to reduce the required implementation effort. We have 
shown when this technique is sound, that it can incorporate concurrency, and 
that it can be implemented in an existing verifier with acceptable performance. 
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Abstract. In recent years they have been numerous works that aim to 
automate relational verification. Meanwhile, although Constrained Horn 
Clauses (CHCs) empower a wide range of verification techniques and 
tools, they lack the ability to express hyperproperties beyond k-safety 
such as generalized non-interference and co-termination. 

This paper describes a novel and fully automated constraint-based 
approach to relational verification. We first introduce a new class of pred- 
icate Constraint Satisfaction Problems called pfwCSP where constraints 
are represented as clauses modulo first-order theories over predicate vari- 
ables of three kinds: ordinary, well-founded, or functional. This general- 
ization over CHCs permits arbitrary (i.e., possibly non-Horn) clauses, 
well-foundedness constraints, functionality constraints, and is capable of 
expressing these relational verification problems. Our approach enables 
us to express and automatically verify problem instances that require 
non-trivial (i.e., non-sequential and non-lock-step) self-composition by 
automatically inferring appropriate schedulers (or alignment) that dic- 
tate when and which program copies move. To solve problems in this 
new language, we present a constraint solving method for pfwCSP based 
on stratified CounterExample-Guided Inductive Synthesis (CEGIS) of 
ordinary, well-founded, and functional predicates. 

We have implemented the proposed framework and obtained promis- 
ing results on diverse relational verification problems that are beyond 
the scope of the previous verification frameworks. 


Keywords: Relational verification - Constraint solving - CEGIS 


1 Introduction 


We describe a novel constraint-based approach to automatically solving a wide 
range of relational verification problems including k-safety, co-termination [6, 
10], termination-sensitive non-interference (TS-NI) [63], and generalized non- 
interference (GNI) [40] for infinite-state programs. 
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A key challenge in relational property verification is the discovery of rela- 
tional invariants which relate the states of multiple program executions. How- 
ever, whereas most prior approaches must fix the execution schedule! (e.g., lock- 
step or sequential) [8,20,21,42,54,57], a recent work by Shemer et al. [50] has 
proposed a method to automatically infer sufficient fair schedulers to prove the 
goal relational property. Importantly, the schedulers in their approach can be 
semantic in which the choice of which program to execute can depend on the 
states of the programs as opposed to the classic syntactic schedulers such as 
lock-step and sequential that can only depend on the control locations. How- 
ever, their approach requires the user to provide appropriate atomic predicates 
and is not fully automatic. Moreover, they only support k-safety properties. A 
recent work has proposed a method for automatically verifying non-hypersafety 
relational properties but only for finite state systems [19]. 

Meanwhile, today’s constraint-based frameworks are also insufficient at 
automating relational verification. The class of predicate constraints called Con- 
strained Horn Clauses (CHCs) [13] has been widely adopted as a “common inter- 
mediate language” for uniformly expressing verification problems for various pro- 
gramming paradigms, such as functional and object-oriented languages. Example 
uses of the CHCs framework include safety property verification [29,30,35] and 
refinement type inference [33,36,53,56,66]. The separation of constraint gener- 
ation and solving has facilitated the rapid development of constraint generation 
tools such as RCAML [56], SEAHORN [30], and JAYHORN [35] as well as efficient 
constraint solving tools such as SPACER [87], ELDARICA [32], and HoIcE [14]. 
Unfortunately, CHCs lack the ingredients to sufficiently express these relational 
verification problems. 

In this paper we introduce automated support for relational verification by 
generalizing CHCs and introducing a new class of predicate Constraint Sat- 
isfaction Problems called pfwCSP. This language allows constraints that are 
arbitrary (i.e., possibly non-Horn) clauses modulo first-order theories over pred- 
icate variables that can be functional predicates, well-founded predicates or ordi- 
nary predicates. We then show that, thanks to the enhanced predicate vari- 
ables, pfwCSP can express non-hypersafety relational properties such as co- 
termination [11], termination-sensitive non-interference (TS-NI) [63], and gen- 
eralized non-interference (GNI) [40]. In addition, our approach effectively quan- 
tifies over the schedule, expressing arbitrary fair semantic scheduling thanks to 
non-Horn clauses and functional predicates (functional predicates are needed to 
express fairness in the presence of non-termination which is needed for prop- 
erties like co-termination and TS-GNI). The flexibility allows our approach to 
automatically discover a fair semantic schedule and verify difficult relational 
problem instances that require non-trivial schedules. We prove that our encod- 
ings are sound and complete. Expressing relational invariants with such flexible 
scheduling is not possible with CHCs. However, pfwCSP retains a key benefit of 
CHCs: the idea of separating constraint generation from solving. 


1 The notion of schedule is also often called an alignment in literature. 


744 H. Unno et al. 


We next present a novel constraint solving method for pfwCSP based on strat- 
ified CounterExample-Guided Inductive Synthesis (CEGIS) of ordinary, well- 
founded, and functional predicates. In our method, ordinary predicates represent 
relational inductive invariants, well-founded predicates witness synchronous ter- 
mination, and functional predicates represent Skolem functions witnessing exis- 
tential quantifiers that encode angelic non-determinism. These witnesses for a 
relational property are often mutually dependent and involve many variables 
in a complicated way (see the extended report [58] for examples). The synthe- 
sis thus needs to use expressive templates without compromising the efficiency. 
Stratified CEGIS combines CEGIS [51] with stratified families of templates [55] 
(i.e., decomposing templates into a series of increasingly expressive templates) 
to achieve completeness in the sense of [34,55], a theoretical guarantee of conver- 
gence, and a faster and stable convergence by avoiding the overfitting problem 
of expressive templates to counterexamples [44]. The constraint solving method 
naturally generalizes a number of previous techniques developed for CHCs solv- 
ing and invariant/ranking function synthesis, addressing the challenges due to 
the generality of pfwCSP that is essential for relational verification. 

We have implemented the above framework and have applied our tool PCSAT 
to a diverse collection of 20 relational verification problems and obtained promis- 
ing results. The benchmark problems go beyond the capabilities of the existing 
related tools (such as CHCs solvers and program verification tools). PCSAT has 
solved 15 problems fully automatically by synthesizing complex witnesses for 
relational properties, and for the 5 problems that could not be solved fully 
automatically within the time limit, PCSAT was able to solve them semi- 
automatically provided that a part of an invariant is manually given as a hint. 


2 Overview 


2.1 Relational Verification Problems 


k-safety. Consider the following program taken from [50] that uses a summation 
to calculate the square of x, and then doubles it. 


doubleSquare(bool h, int x) { 
int z, y=0; 
if (h) { z = 2*x; } else { z = x; } 
while (z>0) { z--; y = ytx; } 
if (th) { y = 2*y; } 
return y; 


} 


This program also takes another input h and, if the value of h is true, calculates 
the result differently. The classical relational property termination-insensitive 
non-interference (TI-NI) says that, roughly, an observer cannot infer the value 
of high security variables (h in this case) by observing the outputs (y). This is a 
2-safety property [17,54]: it relates two executions of the same program. In this 
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example, we ask whether two executions that initially agree on x (i.e., x1 = x2) 
will agree on the resulting y (i.e., y} = y2). The subscripts in these relations 
indicate copies of the program: x, is variable x in the first copy of the program 
and xg is variable x in the second copy. More generally, k-safety means that if 
the initial states of a k-tuple of programs satisfy a pre-relation Pre, then when 
they all terminate the k-tuple of post states will satisfy post-relation Post. 

The literature proposes many ways to reason about k-safety including meth- 
ods of reducing a multi-program problem to a single-program problem, such 
as through self-composition [8,54,57], product programs [7], and their vari- 
ants [21,46,50,52,59]. Their key challenge is that of scheduling: how to interleave 
the programs’ executions so that invariants in the combined program are able to 
effectively describe cross-program relationships. Indeed, as proved by [50], ver- 
ifying this example with the naive lock-step scheduling is impossible with only 
linear arithmetic invariants while linear arithmetic invariants suffice with a more 
“semantic” scheduling that schedules the copy with hı = false to iterate the 
loop twice per each iteration of the loop in the copy with hg = true. 

In this paper, we will describe a way to pose the scheduling problem as a part 
of a series of constraints so that the search for an effective scheduler is relegated 
to the solver level. In our approach, a k-safety verification problem is encoded as 
a set of constraints containing (ordinary) predicate variables that represent the 
scheduler to be discovered and a relational invariant preserved by the scheduler. 
Specially, we introduce a predicate variable inv that represents a relational invari- 
ant and for each A C {1,...,k}, a predicate variable sch4(Vi,..., Vp) where V; 
are the variables of the ith program, and add constraints that say that if the 
predicate is true, then the programs whose index are in A will step forward 
while the rest remain still and also inv is preserved by the step. For soundness, 
it is important to constrain the scheduler to be fair, i.e., at least one program 
that can progress must be scheduled to progress if there is a program that can 
progress. As we shall show in Sect. 4, non-Horn clauses are essential to expressing 
the fairness constraint. Roughly, the idea is to use a clause with multiple posi- 
tive predicate variables (7.e., head disjunction) to say “if the relational invariant 
holds, then at least one of the unfinished programs must be scheduled to progress.” 

Our approach is similar to and is inspired by the approach of [50] that also 
infers a fair semantic scheduler. However, their approach requires the user to 
provide sufficient atomic predicates manually and is not fully automated. By 
contrast, our approach soundly-and-completely encodes the k-safety verification 
problem together with scheduler inference as a set of constraints thanks to the 
expressiveness of pfwCSP, and automatically solves those constraints by the 
stratified CEGIS algorithm (cf. Sect. 7 for further comparison). 


Co-termination. Now consider the following pair of programs. 


Peet: while (x>0) { x 
Pet; while (x>0) { x 


x-y; } 
x-2x y; } 
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A (non-safety) relational question is whether these programs Pf! and P£°! agree 


on termination [6,10]. In general they do not: if, for example, Pf! is executed 
with x < 0 and P$* with x > OAy = 0, the first will terminate while the second 
will diverge. However, under the pre-relation Pre = xı = x2 ^ y4 = Yo, they will 
agree on termination: the first program terminates iff the second one does. The 
property falls outside of the k-safety fragment as it cannot be refuted by finite 
execution traces. It is worth noting that termination-sensitive non-interference 
(TS-NI) is the conjunction of TI-NI and co-termination of two copies of the same 
target program with Pre equating the copies’ low inputs. 

Proving co-termination, like k-safety, can be aided by scheduler and we can 
again use our constraints over predicate variables. But this is not enough. We 
need additional constraints to ensure that whenever one of the two has ter- 
minated, the other is also guaranteed to terminate. To address this, we next 
introduce well-founded predicate variables. These predicate variables will appear 
in our generalized language of constraints as terms of the form wfr(V;, V;’), where 
the relation wfr must be discovered by the constraint solving method. (In Sect. 5 
we describe how to achieve this through our stratified CEGIS algorithm.) For 
the above example, our stratified CEGIS algorithm and our tool PCSAT auto- 
matically discovers (1) a schedule where the two programs step together when 
zı > 0 and z2 > 0, (2) a relational invariant that implies that if the first pro- 
gram is terminated, then either the second program is terminated or y2 > 1 
(and vice-versa), and (3) well-founded relations that (combined with the rela- 
tional invariant) witness that if the loop has terminated in the second program 
(x2 < 0) but not in the first (xı > 0), then a transition in the first is well- 
founded (and vice-versa). In Sect. 4, we show how co-termination problems can 
be soundly-and-completely encoded in pfwCSP. 


Generalized Non-interference. Now consider the following program. 


gniEx(bool high, int low) { 


if (high) { 

int x = *$7int$; if (x >= low) { return x; } else { while (true) {} } 
} else { 

int x = low; while ($**bool$) { x++; } return x; 
} 


} 


The t"? (resp. x®°°l) above indicates an integer (resp. a binary) non- 
deterministic choice. Termination-insensitive generalized non-interference (TI- 
GNI) [40] is an extension of non-interference to non-deterministic programs, and 
it says that for any two copies of the program with possibly different values for 
the high security input (high in this example) and with the same value for the 
low security input (low in this example), if one copy has a terminating execution 
that ends in some output (the final value of x in this example), then the other 
copy has either a terminating execution ending in the same output or a non- 
terminating execution. The termination-sensitive variant (TS-GNI) strengthens 
the condition by asserting that if one copy has a terminating execution then the 
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other copy has a terminating execution that ends in the same output. Both GNI 
variants are V4 hyperproperties and fall outside of the k-safety fragment. 
Verifying GNI requires handling non-determinism. Note that non- 
determinism occurs both demonically (i.e., as V) and angelically (i.e., as 3) 
in GNI. While handling demonic non-determinism is straightforward in a 
constraint-based verification since the term variables are implicitly universally 
quantified, handling angelic non-determinism is less straightforward. 


aa baad tak, sca hid pfw-CSP Solving pfw-CSP alld 
[LL -a |e Caen Constraint via Stratified 
PoR J $ fi problems CEGIS 
aaa (Section 3) (Section 5) | *( unsar ) 
* Co-Termination 
_* Generalized NI Implementation & Evaluation (Section 6) 


Fig. 1. Overview of the contributions and how they achieve a constraint-based strategy 
for relational verification. 


Our approach handles finitary angelic non-determinism like *°°°* by adding 
non-Horn clauses with head disjunctions that roughly express the condition “the 
relational invariant remains true in one of the finitely many next step choices”. 
To handle infinitary non-determinism like **"*, we introduce functional predicate 
variables denoted f(V,r). In these terms, f is a predicate variable to be discov- 
ered but with a new wrinkle: this predicate_involves a return value r and the 
interpretation of f is a total function over V. For this example, we introduce 
the term f(V,r) where r represents the value chosen non-deterministically at 
«int and V are program variables and prophecy variables that represent the final 
return values of the demonic copy. For this example, PCSAT automatically dis- 
covers the predicate r = ret; where ret; is the prophecy variable for the return 
value of the demonic copy. With it, PCSAT is able to verify TS-GNI and TI-GNI 
of this example. We remark that functional predicates are also used to encode 
scheduler fairness in the presence of non-termination and is needed to ensure 
soundness for properties like co-termination and TS-GNI. In Sect. 4.3, we show 
how TI-GNI and TS-GNI can be soundly-and-completely encoded in pfwCSP. 


2.2 Challenges and Contributions 


There are several challenges that we face in supporting relational verification 
problems with a constraint-based approach. The subsequent sections of this 
paper are organized around addressing those challenges as follows: 


— We first ask how to generalize the constraint language to go beyond CHCs 
to express a more general class of relational verification problems. To this 
end, in Sect.3, we present a new language called predicated constraint satis- 
faction problems (pfwCSP), which incorporate non-Horn clauses, (ordinary) 
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predicate variables, well-founded predicate variables, and functional predicate 
variables. 

— We next return to the above relational verification problems —k-safety, co- 
termination, and generalized non-interference— and describe how pfwCSP can 
express each of them in a sound and complete manner in Sect. 4. 

— The next major contribution of our research is a novel stratified CEGIS algo- 
rithm for solving pfwCSP constraints. Our approach integrates advanced ver- 
ification techniques: stratified family of templates [55] and CEGIS of invari- 
ants/ranking functions [14,26,28,45]. While the individual ideas have been 
proposed previously, they have only been designed for less expressive frame- 
works such as CHCs, and substantial extensions are needed to combine and 
apply them to the new pfwCSP framework as we shall show in Sect. 5. 

— We next turn to an implementation and experimental validation on a diverse 
collection of 20 relational verification problems, consisting of k-safety prob- 
lems from Shemer et al. [50] and new co-termination and GNI problems in 
Sect.6. As far as we know, none of the existing automated tools other than 
our new tool called PCSAT can solve them. 


In sum, Fig.1 depicts each of these sections and how, together, they enable 
relational verification. For space, extra materials are deferred to the extended 
report [58]. 


3 Predicate Constraint Satisfaction Problems pfwCSP 


As discussed in Sect.2, CHCs are insufficient to express important relational 
verification problems. In the section we introduced a generalized language of 
constraints called pfwCSP. The language of constraint satisfaction problems 
(CSP) permits non-Horn clauses, predicate variable terms, including those for 
functional predicates and well-founded relations (pfw). We now define pfwCSP. 

Let T be a (possibly many-sorted) first-order theory with the signature X. 
The syntax of T-formulas and T-terms is: 


(formulas) ¢:: = X(t) | p(t) | =@ | 61 V b2 | %1 A Q2 


(terms) t:: = x | f(t) 


Here, the meta-variables x and X respectively range over term and predicate 
variables. The meta-variables p and f respectively denote predicate and function 
symbols of X. We use s as a meta-variable ranging over sorts of the signature 
X. We write x for the sort of propositions and sı — s2 for the sort of functions 
from sı to s2. We write ar(o) and sort(o) respectively for the arity and the 
sort of a syntactic element o. A function f represents a constant if ar(f) = 0. 
We write ftu(d) and fpv(¢) respectively for the set of free term and predicate 
variables that occur in ¢. We write 7 for a sequence of term variables, |X| for 
the length of x, and e for the empty sequence. We often abbreviate ~g V 2 as 
pı = ¢2. We henceforth consider only well-sorted formulas and terms. We use 
yy as a meta-variable ranging over 7 -formulas without predicate variables. 
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We now define a pCSP C (with ordinary but without well-founded and func- 
tional predicate variables) to be a finite set of clauses of the form 


eV (Vier XH) V (Veen E) (1) 


where 0 < £ < m. We write ftv(c) for the set of free term variables of a clause 
c. The set of free term variables of C is defined by ftu(C) = Usce ftu(e). We 
regard the variables in ftu(c) as implicitly universally quantified. We write fpu(C) 
for the set of free predicate variables that occur in C. A predicate substitution 
g is a finite map from predicate variables X to closed predicates of the form 
A®1,+++;Lar(x)-~- We write o(C) for the application of o to C and dom(c) for 
the domain of ø. We call o a syntactic solution for C if fpv(C) C dom(c) and 
E A o(C). Similarly, we call a predicate interpretation p a semantic solution for 
C if fpv(C) C dom(p) and p E AC. 


Remark 1. The language pCSP generalizes over existing languages of con- 
straints. CHCs can be obtained as a restriction of pCSP where £ < 1 in (1) 
for all clauses. We can also define coCHCs as pCSP but with the restriction that 
m < €+1 for all clauses. A linear CHCs is a pCSP that is both CHCs and 
coCHCs. 


We next extend pCSP to pfwCSP by adding well-foundedness and function- 
ness constraints. A pfwCSP (C, x) consists of 


— a finite set C of pCSP-clauses over predicate variables and 

— akinding function K that maps each predicate variable X € fpv(C) to its kind: 
any one of è, |}, or A which respectively represent ordinary, well-founded, and 
functional predicate variables. 


We write p | WF(X) if the interpretation p(X) of the predicate variable X is 
well-founded, that is, sort(X) = (8,5) — x for some $ and there is no infinite 
sequence 01, 02,... of sequences Ù; of values of the sorts $ such that (Ùi, 0:41) € 
p(X) for all i > 1. We write p | FN(X) if X is functional, that is, sort(X) = 
(5, s) — x for some $ and s, and p H VZ: 5.(dy: s.X(%, y))AVy1, yo: s.(X (T, y1)A 
X(T, y2) = yı = y2) holds. We call a predicate interpretation p a semantic 
solution for (C, K) if p is a semantic solution of C, p | WF (X) for all X such 
that K(X) =}, and p H FN(X) for all X such that K(X) = à. The notion of 
syntactic solution can be similarly generalized to pfwCSP. 


Pa 


Definition 1 (Satisfiability of pfwCSP). The predicate satisfiability problem 
of a pfwCSP (C, K) is that of deciding whether it has a semantic solution. 


Remark 2. Recall that we assume that the T-formulas y in pCSP clauses do not 
contain quantifiers. The assumption, however, is not a restriction for pfwCSP 
because we can Skolemize quantifiers using functional predicates. 
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4 Relational Verification with Constraints 


We now present reductions from relational verification problems to pfwCSP, thus 
enabling a new route to automation of these problems. We begin with k-safety, 
and then move toward liveness and non-determinism, which are thorny problems 
in the relational setting. We first provide some basic definitions and notations. 


Programs. We consider programs P,,...,P, on variables Vp. E Vp respectively. 
A state of the program P; is a valuation of the variables Vi. We represent such 
a valuation by a sequence of values U such that |v| = |V;|. We assume that each 
P, is defined by the predicate T; (Vi, v) denoting its one-step transition relation 
i.e., T;(v,0’) implies that evaluating P; one step from the state v reaches the 
state 0’. We also assume that there is a predicate F,(V;) that represents the 
final states of the program such that F;(v) and T;(v, 0’) implies 0 = v’, i.e., the 
program self-loops when it reaches a final state. We say that a state Ù (multi- 
step) reaches a final state 0’ in the evaluation of P;, written 0 ~; 0’, if there 
exists a non-empty finite sequence of states m such that [1] = ù, a[|7|] = v, 
T;(a[j — 1], 7[7]) for all 1 < j < |r|, and Fj(v’). We write v ~; L if there exists a 
non-terminating evaluation from v in P;, i.e., if there exists an infinite sequence 
of states w such that [1] = v, T;(@|y — 1], w|j]) for all 1 < j, and 4F;(a[j]) for 
all 0 < j. We note that a program may be non-deterministic, that is, T;(0, 0’) 
and T;(v,v”) may both be true for some v’ 4 0”. 


4.1 k-Safety 


A k-safety property is given by predicates Pre(V) and Post(V) that respectively 
denote the pre and the post relations across the k-tuple. 


Definition 2 (k-safety). The k-safety property verification problem is to decide 
if the following holds: 


VÒ = 01,...,0%.Vo' = &',..., Ok .Pre(v) A Niet T ~~i n > Post") 


That is, any k-tuple of final states reachable from a k-tuple of states satisfying 
the precondition satisfies the post condition. For instance, the TI-NI verification 
from Sect. 2.1 is a 2-safety property where P) and P, are copies of the same 
program, Pre states that the low inputs of the two programs are equal (i.e., 
xı = X2 in the example), and Post states that the low outputs of the two 
programs are equal (i.e., y4 = y in the example). 

We now describe a new way to pose the k-safety relational verification prob- 
lem via constraints written in pfwCSP. We write [k] for the set {1,...,k}. We 
define P[k] = {S C [k] | S # 0}. Let V = Vi,...,Vp be a k-tuple of vectors, 
corresponding to the variables of the k programs. 


Definition 3 (k-safety through constraints). We define pfwCSP con- 
straints Cs be the set of following clauses: 
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(1) Pre(V) = inv(V) _ 7 
(2) inv(V) A Aep Fi(Vi) > Post(V) 
(3) For each AE P[k], N ee 7 
inv(V) Ascha(V) A Nica Ti (Vi, Vi) \ Niena Vi =v = inv(V’) 7 
(4) For each A € P*[k], inv(V) A scha(V) A Viet] (Vi) > Vieca FV) 


(5) inv(V) A Vie] “E (Vi) > V aep+ ay scha (V). 


Here, inv and sch, (for each A € P*{k]) are ordinary predicate variables. 
Roughly, the predicate variables sch, describe a scheduler. The scheduler stipu- 
lates that when sch 4 (1, .. . , 0%) is true, each P; such that i € A takes a step from 
the state v; while the others remain still. Note that the scheduler is semantic in 
the sense that which programs are scheduled to be executed next can depend on 
the current states of the programs. Clauses (1)—(3) assert that inv is an invari- 
ant sufficient to prove the given safety property with the scheduler defined by 
sch,’s. Clauses (4) say that if an inv-satisfying state is such that the processes 
in A are allowed to move and some program has not yet terminated, then at 
least one process in A has not yet terminated. Clause (5) says that any state 
satisfying inv has to satisfy some sch4. Clauses (4) and (5) ensure the fairness 
of the scheduler, that is, at least one unfinished program is scheduled to make 
progress if there is an unfinished program. 


Theorem 1 (Soundness and Completeness of Cs). The given k-tuple of 
programs satisfies the given k-safety property iff Cg is satisfiable. 


We note that the soundness direction crucially relies on scheduler fairness. The 
completeness is with respect to semantic solutions (cf. Definition 1) and it is 
only “relative” with respect to syntactic solutions: a syntactic solution only 
exists when the predicates of the background theory are able to express sufficient 
invariants and schedulers (impossible in general for any decidable theory when 
the class of programs is Turing-powerful as in our case when the background 
theory of predicates is QFLIA). 

It is important to note that Cg is not CHCs because clause (5) has a head 
disjunction. Cg may be seen as a constraint-based formulation of the approach 
proposed in [50]. However, their approach requires the user to provide sufficient 
predicates manually and is not fully automated, while our approach can fully 
automatically solve the problems by constraint solving (cf. Sect. 5). 


Example 1. The formalization allows flexible scheduling. For instance, for the TI- 
NI example from Sect. 2.1, our approach is able to infer the predicate substitution 
that maps schy1}, schy2}, and schy; 9} to AV.hy A “h2 A z1 = 2z2, AV .=h; A h2 A 
Z2 = 22), and dV .(h1 Ah: > zı +1 = 2z2) A (shy Ah A Z2 +1 = 221) 
respectively, where V is the list of the variables in the two program copies. The 
inferred predicates stipulate that the copy with h = true is scheduled to execute 
the loop two times per every loop iteration of the copy with h = false. The 
extended report [58] shows the pfwCSP encoding of the example. A solution 
generated by PCSar is also shown in [58]. 
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4.2  Co-termination 


Intuitively, co-termination means that if one program terminates, then a second 
program must terminate [6,10]. This can also be thought of as a form of relational 
termination problem.? 


Definition 4 (Co-termination). The co-termination verification problem is 
to decide if for all 01, 02 such that Pre(0j, v2), if 0) ~1 vi then v2 Ae L 


Roughly, the property says that from any pair of states related by Pre, if Pı 
terminates, then P> must also terminate. Note that this is an asymmetric prop- 
erty. A symmetric version can be obtained by also asserting the property with 
the positions of the two programs exchanged. The symmetric version implies, 
assuming that there is at least one execution from any Pre-related state, that 
from any pair of Pre-related states, all executions from one state terminates iff 
all executions from the other one do as well. We now present an encoding of 
conditional co-termination in pfwCSP. 


Definition 5 (Co-termination through constraints). Let V= Vi, V2. We 
define pfwCSP constraints Cogr be the set of following clauses: 

(1) Pre(V) A fnb(V, b) = inv(0, b, ,V) 

(2) inv(d, b, V)A AF, (Vi) A -P (Va) > G b<dAd<bAb2 0) 


(3a) inv(d, b, V) A scher(d, b, V) A Te (Va, Va.) A (Fi (V1) V oa) Vd = d—-1) > 
inv(d’, b, Vi, V2) 
(3b) inv(d,b, V) Aschre(d, b, V) A T, (Va, Va) A (FV) V Foa) Vd! = d+1) > 
inv(d’, b, vA ' Va) 
(3c) inv(d,b, V) A schrr(d, b, V) A Ti (Vi; Vi ) A To(Va, V2 ) > inv(d, b, Vi , Vo ) 
(4a) inv(d,b, V V) A scher(d, b, V V)A ~F (Vi) => =F (V2) 
(4b) inv(d, b, V V) A schre(d, b, V) A Fo (V2) > =F (V1) 
(5) inv(d, b, V)A (“Fi (Vi) V “F,(V2)) = Vaegrtrt,tF} SCha(d, b, V) 
(6) inv(d, b, V) A Fi (V1) A ~F (V2) A Te(Va, Va ) > wfr(Va, Va ) 


Here, schrr, schpr, and schtp are 2-specialization of the k-safety scheduler of 
Definition 3. Clauses (3x)’s are similar to (3) of Definition 3 and assert that inv 
is an invariant under the scheduler. Clauses (4x)’s and (5), like (4) and (5) of 
Definition 3, are used to ensure the scheduler fairness. However, they are insuffi- 
cient for co-termination as a non-terminating copy can be scheduled indefinitely 
leaving the other copy unscheduled. Clauses (1) and (2) are added to amend the 
issue. In (1), fnb is a functional predicate variable that is used to select a bound 
b, and (2) asserts that the difference d between the numbers of steps taken by the 
two copies is within b in any state in inv when neither copy has terminated. Note 
that d is initialized to 0 by (1) and properly updated in (3x)’s. Finally, by using 
the well-founded predicate variable wfr, (6) asserts that if P; has terminated, 
then so must eventually P». 


? The property has also been called relative termination [31]. 
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Theorem 2 (Soundness and Completeness of Ccor). The given pair of 
programs co-terminate iff Coor is satisfiable. 


As with Theorem 1, the soundness direction relies on scheduler fairness. 


Example 2. Via the encoding, our PCSAT tool is able to verify the symmetric 
co-termination example from Sect.2.1 by automatically inferring the solution 
described there. For space, the concrete constraint set and solution are given in 
the extended report [58]. 


4.3 Generalized Non-interference 


We now turn to another relational property that cannot simply be captured by 
k-safety or co-termination. So-called termination-insensitive (resp. -sensitive) 
generalized non-interference (resp. TI-GNI, TS-GNI) are V4 hyperproperties: 
from any pre-related pair of states whenever one side can take a move to a post 
state, there must be a way for the other side to also move to a post state such that 
the post-relation holds. As remarked in Sect. 2, verifying GNI requires reasoning 
about both demonic (t.e., for all) and angelic (i.e., exists) non-determinism. 


Definition 6 (TI/TS-GNI). The GNI verification problem is to decide if the 
following holds. If Pre(@1, 02) and 01 ~>1 0,’ then (TI-GNI) Ener wg, Da! A 
Post(®', oe) V U2 ~2 L; or (TS-GNI) Fs! 0g ~~ 02 A Post(®', 2’). 


Note that our definition is parameterized by Pre and Post. The standard GNI 
definitions [40] can be obtained by letting P, and P> be copies of the same target 
program and letting Pre be the predicate equating the low inputs of the copies 
and Post be the predicate equating the low outputs of the copies. 

To formalize the pfwCSP encodings of the GNI verification problems, 
we define a relation Uz to be one such that 72(v,v’) = Ar.U2(r,v,0’) and 
Uo(r, 0,0") A Ud(r, 0,0") => Y = 0”. Roughly, U2 is a function version of the 
transition relation Tə with the extra parameter r to make the non-deterministic 
choices explicit. 

We now show the pfwCSP encodings of TI-GNI and TS-GNI. The key idea 
is to augment the encodings for k-safety and/or co-termination with functional 
predicate variables and prophecy variables that respectively represent the non- 
deterministic choices of the angelic side (i.e., P2) and the final outputs of the 
demonic side (i.e., P1). 


Definition 7 (TI-GNI through constraints). We define pfwCSP con- 
straints Crrgnr as Cg in Definition3 for k = 2 but with the following modifi- 
cations: 


(m1) The parameters representing the inputs and outputs of P, is extended with 
prophecy variables p where |p| = |Vi|. Accordingly, each occurrence of Vj is 
replaced by p, Vi, and each occurrence of Vi is replaced by p, A ae 

(m2) Pre is replaced by Pre’ which is defined by Pre’(p, Vi, V2) = Pre(Vi, Və), 
i.e., the prophecy values are unconstrained in the precondition. 
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(m3) Fı is replaced by Ff defined by Fi (P, Vi) = Fi(Vi). 

(m4) Tı is replaced by Tj defined by Ti (Ð, V1, P, Vi) © T1Vi,Vi ) Ap =p. 

(m5) Post is replaced by Post’ defined by Post'(p,Vi,V2) e (p = Vi => 
Post(V;, Va)), i.e., if the prophecy was correct then the original post con- 
dition must hold. eee o eee: 

(m6) Each occurrence of T>(V2, V2 ) is replaced by fnr(p, V2,r) A U2(r, V2, V2 ) 
where fnr is a functional predicate variable. 


Modifications (m1)—(m5) concern prophecy variables. They are initialized arbi- 
trarily as shown in (m2), propagated unmodified through the transitions as 
shown in (m4), and finally checked if they match P,;’s outputs in (m5). Mod- 
ification (m6) adds functional predicate variables to express the angelic non- 
deterministic choices of P). The functional predicate variables shift the onus of 
making the right choices to the solver’s task of discovering sufficient assignments 
to them. Importantly, the functional predicate takes the prophecy variables as 
parameters, thus allowing dependence on the final outputs of the demonic side. 


Definition 8 (TS-GNI through constraints). We define pfwCSP con- 
straints Cpsqanq as Ccor in Definition5 but with modifications of Definition 7 
except (m3) and (m5), and with the following modifications: 


(m3’) Fı is replaced by FY defined by FI (P, Vi) & Fi(Vi) AD = Vi. 
(m5’) The clause inv(p, Vi, V2) A Fi (Ð, Vi) A Fo(V2) = Post(Vi, V2) is added. 


Crsant is similar to Cran except that it contains the difference bound and well- 
foundedness constraints to handle the “co-termination” aspect of TS-GNI, i.e., 
if Pı terminates and makes an output then P> must also be able terminate and 
make a matching output. One subtle aspect of the encoding is that (m3’) modifies 
the final state predicate for P, to enforce co-termination only when the prophecy 
is correct. However, it is worth noting that TS-GNI is not a conjunction of TI- 
GNI and co-termination. For instance, the GNI example from Sect. 2.1 satisfies 
TS-GNI but does not satisfy co-termination. 


Theorem 3 (Soundess and Completeness of TI-GNI). The given pair of 
programs satisfy TI-GNI iff Crient is satisfiable. 


Theorem 4 (Soundess and Completeness of TS-GNI). The given pair 
of programs satisfy TS-GNI iff Crsani is satisfiable. 


The soundness directions are proven by “determinizing” the angelic choices by 
solutions to the functional predicate variables and reducing the argument to 
those of k-safety and co-termination. The completeness directions are proven by 
“synthesizing” sufficient angelic choice functions from program executions. 


Example 3. Via the encoding, our PCSAT tool is able to verify the TS-GNI 
example from Sect. 2.1 by automatically inferring not only the functional pred- 
icate described there but also relational invariants and well-founded relations 
given in the extended report [58]. For space, the concrete constraint set is also 
given in [58]. 


Constraint-Based Relational Verification 755 


Remark 3. The angelic non-determinism encoding can be optimized by 
using head disjunctions when the non-determinism is finitary (i.e., 
max;|{v’ | To(v, 0’) }| is finite) instead of using functional predicate variables. 
For this, we modify clauses (3) and (3x)’s of Definition 7 and 8 to contain mul- 
tiple positive occurrences of inv where each occurrence represents one of the 
finitely many possible choices. 


Remark 4. Recall that we allow any program to be non-deterministic. The k- 
safety and co-termination encodings treat non-determinism in all programs as 
demonic, whereas the GNI encodings treat those in one program (i.e., P,) as 
demonic and those in the other program (i.e., P2) as angelic. In general, an 
arbitrary program can be made angelic by applying the transformation done in 
the angelic side of GNI encodings (to factor out non-determinism). 


5 Constraint Solving Method for pfwCSP 


We describe a CEGIS-based method for finding a (syntactic) solution of the given 
pfwCSP (C,xK). Our method iterates the following phases until convergence. 
The iteration maintains and builds a sequence o of candidate solutions and a 
sequence £ of example instances where €“ are ground clauses obtained from C by 
instantiating the term variables and serve as a counterexample to the candidate 
solution o=», for each i-th iteration. The iteration starts from E£® = 0. 


Synthesis Phase: We check if (€, K) is unsatisfiable. If so, we stop by return- 
ing € as a genuine counterexample to the input problem (C, K). Otherwise, we 
use the synthesizer Srp (cf. Sect. 5.1) to find a solution o of (€,K), which 
will be used as the next candidate solution. 


Validation Phase: We check if o is a genuine solution to (C, K) by using an 
SMT solver. If so, we stop by returning o as a solution. Otherwise, for each 
clause c € C not satisfied by o, we obtain a term substitution 6, such that 
dom(@.) = ftu(c) and j @.(0(c)). We then update the example set by adding 
a new example instance for each unsatisfied clause (i.e., E°+) = €M U {O.(c) | 
c ECA -o(c)}), and proceed to the next iteration. 

The above procedure satisfies the usual progress property of CEGIS: discov- 
ered counterexamples and candidate solutions are not discovered again in suc- 
ceeding iterations. Furthermore, as discussed in Sect. 5.1, by carefully designing 
the synthesizer Srg by incorporating stratified CEGIS, we achieve complete- 
ness in the sense of [34,55]: if the given pfwCSP (C, X) has a syntactic solution 
expressible in the stratified families of templates, a solution of the pfwCSP is 
eventually found by the procedure. In Sect.5.1, we discuss the details of the 
synthesis phase. There, for space, we focus on the theory of quantifier-free linear 
integer arithmetic (QFLIA). For space, we defer the details of the unsatisfiability 
checking process to the extended report [58]. 


Remark 5. The implementation described in Sect. 6 contains an additional phase 
called resolution phase for accelerating the convergence. There, we first apply 
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unit propagation repeatedly to the given € to obtain positive examples €+ 
of the form X(%) and negative examples €~ of the form =X (T). We then 
repeatedly apply resolution principle to the clauses in the input clauses C and 
the clauses €+ UE to obtain additional positive and negative examples. 


5.1 Predicate Synthesis with Stratified Families of Templates 


We describe our candidate solution synthesizer Srg. Srp performs a template- 
based search for a solution to the given example instances. As we shall show, our 
approach allows searching for assignments to all predicate variables (of all three 
kinds) in the given instance which is important because satisfying assignments to 
different predicate variables often inter-dependent. There, however, is a trade-off 
between expressiveness and generalizability. With less expressive templates like 
intervals, we may miss actual solutions. But with very expressive templates like 
polyhedra, there could be many solutions, and a solution thus returned is liable 
to overfitting, i.e., the solution to the example instances becomes too specific 
to be an actual solution to the original input clauses. [44] discusses a similar 
overfitting issue in the context of grammar-based synthesis. 


Stratified Template Family for Ordinary Predicate Variables: 
Tx (nd, nc, ac, ad) Ê Mz, eg Zar). Ve Ajai cijo + SoN Ci,j,h `k > 0 
x (nd, nc, ac, ad) Ê a NE ee Ne. j,k] < ac) A |ci jo] < ad 


al a Template Family for Well-Founded Predicate Variables: 
Tx (np, nl, nc, re, rd, dc, dd) Ê re F). Av ALa rie(E) > OA (V2, Di(Z))A 
(V2 D 5(Y)) A (Vi, Di @ T) A Nje (D0) > DEC: (T, 9))) 
$% (np, nl, nc, rc, rd, dc, dd) = AP, At, ( ar) e, kel < re) A |cik,o| < rd A 
Na Nea lel eel S de) Aleiaol < dd 
DEC, ; (2,9) 3 Viii lra T) > rj, $ z 1 Tie(Z) > 75, ids 
Tik (TZ) = ciko + Seas Cine te Dil) 2 AR Ciko + Se QON e p> Te >0 


Stratified Template Family for Functional Predicate Variables: 
TÌ (nd, nc, de, dd, ec, ed) Ê A(Z, r).r = if D1 (T) then e1(Z) else if D2 (7) then e2(Z) --- 
e if Dna—1(£) then ena—1(Z) else ena(Z) 
$x (nd, ne, ec, ed, dc, dd) Ê Ne co Ness] < ec) A [cio] < ed A 
NE Nia fat leky S de) A [cijo] < dd 
ei(t) Scio + Dee age, DiE) E NE chjo + De T cje ee 20 


Fig. 2. Stratified families of templates 


Our remedy to the issue is stratified families of predicate templates, inspired 
by a similar approach proposed in the context of predicate abstraction with 
CEGAR [34,55]. Initially, we assign each predicate variable a less expressive 
template and gradually refine it in a counterexample-guided manner: if no solu- 
tion exists in the current template, we generate and analyze an unsat core to 
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identify the parameters of the families of templates that should be updated. The 
stratification of templates thus automatically pushes the template to an expres- 
sive one (e.g., polyhedra) when it needs to. Importantly, with our approach, 
expressive templates are not always used but only when they should be used. 


Stratified Families of Templates. We have designed three stratified families 
of templates shown in Fig. 2, respectively for ordinary (e), well-founded (J), and 
functional (A) predicate variables. First, for each ordinary predicate variable X, 
we prepare the stratified family of templates TY (nd, nc, ac, ad) with unknowns 
Ci,j,k'S to be inferred and its accompanying constraint (nd, nc, ac, ad). The 
body of T% is a DNF with affine inequalities as atoms. The parameter nd (resp. 
nc) is the number of disjuncts (resp. conjuncts). The parameter ac is the upper 
bound of the sum of the absolute values of coefficients c; jẹ (k > 0), and ad is 
the upper bound of the absolute value of c; 4,0- 

Secondly, for each functional predicate variable X, we prepare the strati- 
fied family of templates TX (np, nl, nc, rc, rd, dc, dd) with unknowns ¢;,;,,’s and 
Ci jik S and its accompanying constraint o% (np, nl, nc, rc, rd, dc, dd). T% repre- 
sents the well-founded relation induced by a piecewise-defined lexicographic affine 
ranking function [2,39,39,60,61] where r; ; is the affine ranking function tem- 
plate for the j-th lexicographic component of the i-th region specified by the 
discriminator D;. The parameter np (resp. nl) is the number of regions (resp. 
lexicographic components). The parameters rc, rd, dc, dd are the upper bounds 
of (the sums of) the absolute values of unknowns, similar to ac and ad of TẸ. 
The first conjunct of ies asserts that the return value of each ranking functions 
is non-negative. The second and the third conjuncts assert that the discrimi- 
nators cover all relevant states. Note that discriminators may overlap, and for 
such overlapping regions, the maximum return value of the ranking functions is 
used. The fourth conjunct asserts that the return value of the piecewise-defined 
ranking function strictly decreases from 7 to y. Here, DEC;,;(Z,y) asserts that 
the return value of the lexicographic ranking function for the i-th region at £ 
is greater than that for the j-th region at y. It follows that for any substitu- 
tion 0 for the unknowns in TE oT) represents a well-founded relation. Our 
implementation PCSAT uses a refined version of re shown in the extended 
report [58]. 

Finally, for each functional predicate variable X, we prepare the stratified 
family of templates T$ (nd, nc, dc, dd, ec, ed) with unknowns ¢;,;’s and Ci jk S 
and its accompanying constraint $}(nd, nc, dc, dd, ec, ed). T characterizes a 
piecewise-defined affine function with discriminators D,,...,D,q—1 and branch 
expressions €1,...,@nq. The parameter nc is the number of conjuncts in each 
discriminator. The parameters dc, dd, ec, ed are the upper bounds of (the sums 
of) the absolute values of unknown, similar to ac and ad of TS. Note that for 
any substitution @ for the unknowns in T$, 6(T%)(Z, r) expresses a total function 
that maps 7 to r. 

Next, we give the details of the candidate solution synthesis process. Let 
p € Z” where n is the number of parameters summed across all templates, and 
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let TY (p) and $%(p) (for a € {e,|, A}) project the corresponding parameters. 
Each p € Z” induces a solution space [D] = {T(p)[6] | 0 | Con(p)} where 


T(p)[6] & {X => O(TE™ (E) | X € fov(C)} and Con(P) È Axefoue Ox? D. 

Let py < po be the point-wise ordering. Note that [p] is a finite set for 
any p € Z”, and ñ < ps implies [pı] C [P2]. We start the CEGIS process 
with some small initial parameters p) (the parameters will be maintained as 
a state of the CEGIS process). The synthesis phase of each iteration tries to 
find a solution @ € [p] to the given example instances (€,K) where p™ are 
the current parameters. This is done by using an SMT solver for QFLIA to 
find 0 satisfying A T(p)[O\(E) A 0(Con(p™)). If such @ is found, we return 
T(p™)|0] as the candidate solution for the next validation phase of the CEGIS 
process. Note that, by construction of the templates, the solution is guaranteed 
to assign each well-founded (resp. functional) predicate variable a well-founded 
relation (resp. total function). Otherwise, no solutions to the given example 
instances (£, X) can be found in [p], and we update the parameters to some 
p@t) > p© such that [p+] contains a solution for (£, K). Here, it is important 
to do the update in a fair manner [34,55], that is, in any infinite series of updates 
pp, ..., every parameter is updated infinitely often (the details are deferred 
to below). By the progress property and the fact that every [p] is finite, this 
ensures that every parameter is updated infinitely often in an infinite series of 
CEGIS iterations. We thus obtain the following property. 


Theorem 5. Our CEGIS-procedure based on stratified families of templates is 
complete in the sense of [34,55]: if there is p and o € [pl] such that o is a 
syntactic solution to the given pfwCSP (C,K), a syntactic solution to (C,K) is 
eventually found by the procedure. 


Note that, while the solution space of each stratum (i.e., [p]) is finite, our 
procedure searches the infinite solution space obtained by taking the infinite 
union of the solution spaces of the template family strata (i.e., Uje,, [P])- 


Remark 6. Our template-based synthesis simultaneously finds ordinary, well- 
founded, and functional predicates that are mutually dependent through the 
given (E, K). This means that templates for different kinds of predicate variables 
are updated in a synchronized and balanced manner, which benefits the synthesis 
of mutually dependent witnesses for a relational property (see the extended 
report [58] for examples). 


Updating Parameters of Template Families via Unsat Cores. We now describe 
the parameter update process. We first obtain the unsat core of the unsatisfi- 
ability of A T(p™)[0](E) A 6(Con(p)) from the SMT solver. We then analyze 
the core to obtain the parameters of template families, such as the number of 
conjuncts and disjuncts, that have caused the unsatisfiability. Here, there could 
be a dependency between predicate variables and in such a case our unsat core 
analysis enumerates all the involved predicate variables from which we obtain 
the parameters of template families to be updated. We then increment these 
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parameters in some fair manner, by limiting the maximum differences between 
different parameters to some bounded threshold, and repeatedly solve the result- 
ing constraint until a solution is found. Thus, the parameters of stratified families 
of templates are grown on-the-fly guided by the reasons for unsatisfiability. We 
found that a careful design of parameter update strategies important for scaling 
the stratified CEGIS to hard relational verification problems. The manual tun- 
ing, however, is tiresome and suboptimal. We leave as future work to investigate 
methods for automatic tuning of parameter update strategies. 


6 Evaluation 


To evaluate the presented verification framework, we have implemented PCSAT, 
a satisfiability checking tool for pfwCSP based on stratified CEGIS. PCSAT sup- 
ports the theory of Booleans and the quantifier-free theory of linear inequalities 
over integers and rationals. The tool is implemented in OCaml, using Z3 [41] as 
the backend SMT solver. We ran the tool on a diverse collection of 20 relational 
verification problems, consisting of k-safety, co-termination, and GNI problems. 
Though we have manually reduced them to pfwCSP using the presented method 
in Sect. 4, this process can be easily automated. The full benchmark set is pro- 
vided in the extended report [58]. All experiments have been conducted on 
3.1 GHz Intel Xeon Platinum 8000 CPU and 32GB RAM with the time limit of 
600s. 

The experimental results are summarized in Table 1. The columns “Time (s)” 
and “#Iters” respectively show elapsed wall clock time in seconds and numbers 
of CEGIS iterations. PCSAT solved 15 verification problems fully automatically 
and 5 problems labeled with the symbol + and/or ł semi-automatically. For the 4 
problems labeled with +, we manually provided small hints for invariant synthesis 
(interested readers are referred to [58]). The provided hints for all but SquareSum 
are non-relational invariants that can be inferred prior to relational verification 
by using a CHCs solver or an invariant synthesizer. For the 2 problems labeled 
with t, we manually chose the initial value for the parameters of the template 
family for ordinary predicate variables to reduce the elapsed time. This can be 
automated by running PCSAT with different initial values in parallel. 

The problems DoubleSquareNI_h**, HalfSquareNI, ArrayInsert, and 
SquareSunm are k-safety verification problems obtained from [50] that require non- 
lock-step scheduling.” The problems DoubleSquareNI_h** are generated from 
Example 1 by a case analysis of the valuation for the boolean variables hı and ho. 
PCSAT solved all the k-safety problems but SquareSum fully automatically. The 
tool Ppsc proposed in [50] can verify them but requires the user to provide the 
atomic predicates for expressing relational invariants and schedulers. The prob- 
lems CotermIntro1 and CotermIntro2 are asymmetric co-termination problems 
obtained from the symmetric problem in Example 2 and are verified by PCSAT 
fully automatically. The problems TS_GNI_h** are generated from Example 3 by 


3 We omitted ArrayIntMod from [50] because its verification requires the theory of 
arrays which the current version of PCSAT does not fully support. 
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a case analysis and are verified by PCSAT with small non-relational hints. We 
have also tested PCSAT on various TS-GNI (SimpleTS_GNI1, SimpleTS_GNI2, 
InfBranchTS_GNI) and TI-GNI problems (TI_GNI_h**) and obtained promis- 
ing results. As far as we know, no existing tools can verify these non-k-safety 
relational problems. 

Furthermore, manual inspection of the PCSAT’s output logs for the GNI 
problems that required hints revealed that the functional predicate synthesis 
appears to be the main bottleneck of the current version. In fact, we confirmed 
that the problems can be solved in less than 10s if appropriate functional pred- 
icates for angelic non-determinism are manually provided. As future work, we 
plan to investigate methods for improved functional predicate synthesis. 


7 Related Work 


7.1 Relational Verification 


There has been substantial work on verifying relational properties. They include 
program logics, type systems, or program analysis frameworks such as abstract 
interpretation and model checking [1,5,9,19,25,52,62], program transformation 
approaches such as self-composition or product programs [4,7,15,20,21,42, 47, 
54,57,64], and various other approaches [3,18,23,46,59]. We refer to [43] for an 
excellent survey. However, most prior automatic approaches address only the k- 
safety fragment [17,54] and cannot verify non-k-safety (actually, not even hyper- 
safety) properties such as co-termination, TS-NI, TI-GNI, and TS-GNI [6, 11, 40]. 
The only exception that we are aware is the recent work by Coenen et al. [19] that 
proposes a sound method for automatically verifying V4 hyperproperties such 
as GNI for finite state systems. To our knowledge, we are the first to propose a 
sound-and-complete approach to automatically verifying these non-hypersafety 
properties for infinite state programs.* 

A key task in many relational verification methods, including ours, is the 
discovery of relational invariants which relate the states of multiple program 
executions. While most prior approaches are limited to fixed execution schedule 
(or alignment) such as lock-step and sequential [7,8,20,21,42,54,57], a recent 
work by Shemer et al. [50] has proposed a k-safety property verification method 
that automatically infers fair schedulers sufficient to prove the goal property. 
Importantly, the schedulers in their approach can be semantic in which the 
choice of which program to execute can depend on the states of the programs 
as opposed to the classic syntactic schedulers such as lock-step and sequential 
that can only depend on the control locations. Our approach also infers such fair 
semantic schedulers, and as remarked before, they enable solving instances like 
doubleSquare that are difficult for previous approaches. However, [50] requires 


4 However, [19] can verify (relational) temporal properties, whereas we only support 
functional properties that are given by pre and post conditions of whole program 
runs. We leave as future work to investigate methods for verifying relational temporal 
properties of infinite state programs. 
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Table 1. Experimental results of PCSAT on the relational verification benchmarks 


Program Time (s) | #Iters Program Time (s) | #Iters 
DoubleSquareNI_hFT | 17.762 42 HalfSquareNI 11.853 35 
DoubleSquareNI_hTF | 26.495 55 ArrayInsertt 118.671 73 
DoubleSquareNI_hFF | 2.944 9 SquareSumj}t 337.596 | 117 
DoubleSquareNI_hTT | 4.055 11 SimpleTS_GNI1 5.397 14 
CotermIntrol 19.322 80 SimpleTS_GNI2 8.919 26 
CotermIntro2 15.871 73 InfBranchTS_GNI 2.607 4 
TS_GNIhFT} 47.083 | 78 TI_GNLhFTT 4.389 16 
TS_GNIhTF 5.076 17 TI-GNI-hTF 22T 
TS-GNI-hFF 7.174 |24 TILGNIhFF 2.968 
TS_GNIhTT+} 23.495 | 53 TILGNIhTT 4.148 | 22 


the user to provide appropriate atomic predicates and is not fully automatic. 
By contrast, our approach soundly and completely encodes the problem as a 
constraint satisfaction problem and fully automatically verifies hard instances 
like doubleSquare by constraint solving. 

Furthermore, our work extends the fair semantic scheduler synthesis to 
beyond k-safety problems like co-termination, TI-GNI and TS-GNI, in a sound 
and complete manner. We note that the extensions are non-trivial and involves 
delicate uses of functional predicate variables and well-founded predicate vari- 
ables to ensure scheduler fairness in the presence of non-termination as well as 
uses of prophecy variables and functional predicate variables to handle angelic 
non-determinism. The higher-degree of automation and the extension to non- 
k-safety properties are thanks to the expressive power of our novel constraint 
framework pfwCSP. 


7.2 Predicate Constraint Solving 


Our pfwCSP solving technique builds on and generalizes a number of techniques 
developed for CHCs solving as well as invariant and ranking function discovery. 
Most closely related to our constraint solving method are CEGIS-based [51] and 
data-driven approaches to solving CHCs [14, 22, 24, 26, 27,38, 44, 45, 48, 49,65]. As 
remarked before, the new pfwCSP framework is strictly more expressive than 
CHCs and extending the prior techniques to the new framework is non-trivial. 

Our stratified CEGIS is inspired by the idea of stratified languages of predi- 
cates proposed in the context of predicate abstraction with CEGAR [34,55]. It 
is also similar in spirit to the work by Padhi et al. [44], but they use a stratified 
family of grammars. Also none of these prior works use unsat cores for updating 
the language/grammar stratum, synthesize well-founded relations and functional 
predicates, or support non-Horn clauses. 
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Our class of pfwCSP constraints is related to existentially-quantified Horn 
clauses (E-CHCs) introduced by Beyene et al. [12]. E-CHCs does not have 
non-Horn clauses or functional predicate variables. However, it has disjunc- 
tive well-foundedness constraints which are similar to our well-founded predicate 
variables. Also, existential quantifiers can be used to encode head disjunctions 
and functional predicates. We conjecture that pfwCSP and E-CHCs are inter- 
reducible, but it is not trivial to fill the gap. Also, inter-reducibility is a desirable 
feature: different formats may have different benefits. For relational verification, 
as we have shown, pfwCSP enables direct sound-and-complete encodings of the 
problems. For instance, head disjunctions allow direct encoding of scheduler 
fairness and finitary angelic non-determinism (cf. Remark 3). And, functional 
predicate variables can be explicitly given necessary-and-sufficient parameters 
to encode angelic non-determinism and difference bounds for ensuring scheduler 
fairness in the presence of non-termination. The tight encodings also lead to 
reduction in search space and benefited the constraint solving. 


8 Conclusion 


We have introduced the class pfwCSP of predicate constraint satisfaction prob- 
lems that generalizes CHCs with arbitrary clauses, well-foundedness constraints, 
and functionality constraints. We have then established a program verification 
framework based on pfwCSP by showing that (1) pfwCSP can soundly-and- 
completely encode various classes of relational problems of infinite-state non- 
deterministic programs, including hard instances of k-safety, co-termination, 
and termination-sensitive generalized non-interference that benefit from state- 
dependent scheduling/alignment (Theorems 1-4), and (2) existing CHCs solving 
and invariants/ranking function synthesis techniques can be adopted to pfwCSP 
solving and further improved with the idea of stratified CEGIS for simultane- 
ously achieving completeness (Theorem 5) and practical effectiveness. 

In future work we plan to investigate ways to improve functional predicate 
synthesis, automatic tuning of parameter update strategies for constraint solving, 
and whether a constraint-based approach (and the techniques presented in the 
present paper) can be extended to reason about relational temporal properties 
such as the ones expressed in hyper temporal logics [16,25]. 
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Abstract. Over the past ten years, the adoption of cloud services has 
grown rapidly, leading to the introduction of automated deployment tools 
to address the scale and complexity of the infrastructure companies and 
users deploy. Without the aid of automation, ensuring the security of 
an ever-increasing number of deployments becomes more and more chal- 
lenging. To the best of our knowledge, no formal automated technique 
currently exists to verify cloud deployments during the design phase. In 
this case study, we show that Description Logic modeling and inference 
capabilities can be used to improve the safety of cloud configurations. 
We focus on the Amazon Web Services (AWS) proprietary declarative 
language, CloudFormation, and develop a tool to encode template files 
into logic. We query the resulting models with properties related to secu- 
rity posture and report on our findings. By extending the models with 
dataflow-specific knowledge, we use more comprehensive semantic rea- 
soning to further support security reviews. When applying the developed 
toolchain to publicly available deployment files, we find numerous vio- 
lations of widely-recognized security best practices, which suggests that 
streamlining the methodologies developed for this case study would be 
beneficial. 


1 Introduction 


The term Infrastructure as Code (IaC) refers to the practice of configuring, 
provisioning, and updating systems resources from source code files, which are 
compiled into atomic instructions and then executed to deploy the desired archi- 
tecture [29]. The advantage of handling code, instead of manually provisioning 
resources, lies in the capability to use version control systems, orchestration 
frameworks, and automated testing tools as part of the deployment process. 
In addition to instructions relevant for resource creation, dependencies, and 
updates, IaC configuration files contain information about settings, dataflow, 
and access control. In a time when cloud companies provide customers with 
simple-to-launch, albeit extremely powerful infrastructure, it is crucial to auto- 
matically and provably verify the security of such systems. 

In this study, we investigate IaC deployment frameworks and how these 
are formally modeled and reasoned upon. We explore the usage of description 
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logics (DLs) as a conceptual-modeling formalism that is expressive, decidable, 
and equipped with mature tooling. We argue that formal reasoning techniques 
applied to deployment templates are an immensely valuable tool for developers 
and security engineers by substantially aiding the automation of time-consuming 
security reviews; helping them to detect complex logical errors at earlier stages; 
and, containing the costs that finding and fixing security issues at later stages 
would cause. As the prevalence of cloud infrastructure increases, in addition to 
experts, automated reasoning tools could benefit inexperienced users as well. 


System Studied. We focus on the Amazon Web Services proprietary IaC tool, 
CloudFormation, the first to be introduced at a large scale, over ten years ago. 
AWS, cloud provider within Amazon, serves millions of customers worldwide. 
These include private businesses as well as government, education, nonprofit, and 
healthcare organizations. While the cloud provider is responsible for the faithful 
deployment of the customers’ desired configurations, it is the customer’s duty 
to make sure that these comply with the security requirements of their business 
context. Few management tools of this scale exist. Notable mentions are Ter- 
raform [37], Microsoft Azure’s Resource Manager [28], Google Cloud’s Deploy- 
ment Manager [19], and the recently introduced OASIS standard TOSCA [6]. 


Goal of Study. Our goal is to improve the quality of the security analyses that 
are performed over IaC configurations pre-deployment; and by doing so, their 
overall security. With this study, we investigate the application of description 
logics to the formalization and reasoning over IaC deployments. In particular, we 
are interested in three aspects: (i) whether proposed cloud configurations comply 
with security best practices, (ii) how to aid customers in building more secure 
infrastructure before deploying it, and (iii) to what extent formal automated 
techniques can support manual pre-deployment security reviews. 


Challenges. Little research has been done so far on the possibility to formalize 
IaC languages, and no research has been done to devise a logic that is well-suited 
to reason about cloud infrastructure. By nature, cloud infrastructure interacts 
with an open environment that is, at best, only partially known. In particu- 
lar, external-facing APIs and users participate in these interactions. By design, 
cloud services allow for the composition of smaller components into large infras- 
tructure, the complexity of which creates a challenge with respect to security. 
Our models should capture the connectivity of resources, the flow of information 
that spans across multiple paths, and the rich security-related data available 
in IaC configuration files. This is further complicated by the need for a query 
language for verification and falsification, able to express that mitigations must 
be present (vs. may be absent), and security issues must be absent (vs. may be 
present). Importantly, we need practical tools that support the implementation 
of all these parts and that can scale to real-world IaC configurations. 


Our Contribution. We provide a framework to encode IaC into description 
logic, and investigate its effectiveness in answering configuration queries and 
reasoning about dataflow, trust boundaries, and potential issues within the sys- 
tem. Specifically, we test DLs reasoning capabilities to infer new facts about 
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underspecified resources (such as those not included in a given deployment but 
used by it) and leverage DLs open-world assumption to perform verification and 
refutation, depending on the property being checked. We formalize additional 
security knowledge that allows for checking system-level semantic properties; i.e., 
properties that consider the nature of the cloud environment and more complex 
reachability over an inferred graph representation of the infrastructure. 

Throughout the study, we make four novel contributions: (i) the formaliza- 
tion and logical encoding of AWS CloudFormation (Sect.3); (ii) a technique to 
express security properties (Sect. 4); (iii) the experimental evaluation of encod- 
ing and query times, accounting for the most common security issues that we 
found over publicly available IaC templates (Sect. 5); and (iv) an extension that 
enables semantic dataflow reasoning (Sect. 6). Our tool is implemented in Scala 
and available online [14]. We include preliminaries in Sect. 2; discuss related work 
in Sect. 7; and conclude in Sect. 8. 


2 Preliminaries 


Description Logics. DLs are a family of logics well suited to model relation- 
ships between entities. They provide the logical foundation of the well-known 
Web Ontology Language [20,23,32], for which extensive tool support exists (e.g., 
the Protégé editor and off-the-shelf reasoners such as FaCT, HermiT, and Pel- 
let [18,30,36,39]). We introduce the description logic ALC [1,24,34], Attributive 
Logic with Complement, and two additional features that are relevant for our 
study. ALC formulae are built from symbols from the alphabets Nc, of atomic 
concept names; Nr, of role names; and Npr, of individual names. These are 
the DL equivalents of FOL unary predicates, binary predicates, and constants, 
respectively. ALC concept expressions are built according to the grammar: 


C,D:= L|T|A|7C|CUD|CND|3r.C |Yr.C 


where A is an atomic concept from the set Nc; C, D are possibly complex con- 
cepts; and r is a role from the alphabet Nr. Terminological knowledge is repre- 
sented via general concept inclusion axioms C ED. As an example, in the remain- 
der of this paper we will refer to two standard axioms that enforce the domain 
and range of binary relations: dom(r, C) = 3r.T EC and ran(r, C) = 3r .T EC. 
Assertional knowledge is represented via concept assertions C(a) and role asser- 
tions r(a,b). In this paper, we will use three additional operators: inverse roles, 
functionality constraints, and complex role inclusions. The first, denoted r7, 
encodes the converse of the binary relationship r. The second enforces binary 
relationships to be functional. The third, written ro s Ct, establishes that the 
chaining of the two relationships r and s implies the relationship t, and can 
be used to implement transitivity (when r = s = t). A model of a DL knowl- 
edge base is an interpretation Z, over a domain A, that satisfies all the axioms 
and assertions contained and implied by the knowledge base. For the purpose of 
our application, we leverage two classical inference problems: satisfiability and 
instance retrieval, whose full definitions are found in standard textbooks [2,3]. 
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AWS CloudFormation. AWS CloudFormation, cfn, provides users with a 
declarative programming language and a framework to provision and manage 
over 500 resources spread across 70 services [15].1 Services are products such as 
storage, databases, and processors, and their interface is implemented through 
resources, which are the actual modules that users declare and deploy. Their 
declaration is done by writing one or more so-called CloudFormation Templates 
(JSON-formatted configuration files). Within a template, users configure settings 
and communication of the desired resource instances. As an example, let us 
consider one of the most widely known storage products within AWS: the Simple 
Storage Service S3 (also illustrated in Listings 1.1 and 1.2). The CloudFormation 
interface for S3 consists of two resources: S3::Bucket and S$3::BucketPolicy. A 
Bucket is a single unit of storage whose properties include encryption, replication, 
and logging settings, which can be viewed as the bucket’s own configuration 
parameters. They could also be references to other resources that are connected 
to the current one, e.g., the unique ID of another bucket where logs are stored. 
A BucketPolicy is a resource that links an access control policy to a bucket. All 
the properties that can be instantiated and the structure of resource-types such 
as $3::Bucket and $3::BucketPolicy are given in the CloudFormation Resource 
Specification [15]. The resource specification is a collection of files that prescribe 
resource properties and their allowed values. Provided that a configuration file is 
valid with respect to the specifications, an IaC deployment environment compiles 
it into instructions that are then executed to provision the requested resources 
in the correct dependency order and with the desired settings. 


3 Formalization and Encoding of IaC Deployments 


While setting up this case study, we found 


. i ss "ResourceType": 
it convenient to come up with a formal- "s3::Bucket": { 
ization, of both IaC resource specifica- "Properties + 
i K "BucketName" : "String" 
tions and IaC configuration files, to use as "LoggingConfiguration": { 


"Type": "LoggingConfiguration", 


an intermediate representation during the ieee Taie J,a P, 


encoding process. This was also needed 


s z "PropertyTypes": ..., 
since we could not find suitable research 53: Bucket . LoggingConfiguration":{ 
in the area (although some preliminary "Properties": { 
E . ` "DestinationBucketName": { 

research on IaC formalization does exist: "Type": "String", 
e.g., the PhD thesis in [12]). As mentioned "Required": false }, 

g "LogFilePrefix":{ 
in Sect. 2, users consult the resource spec- "Type": "String", 

ifications to find out what fields and val- "Required": false J}} 


ues are allowed when declaring a resource. Listing 1.1. S3::Bucket specification 
Intuitively, these provide a sort of type- 

system, or JSON schema, against which configuration files must validate. Con- 
figuration files contain the resource declarations of the instances that the user 
wishes to deploy. Let us illustrate this with some examples. Listing 1.1 shows 
a snippet of the S3::Bucket resource-type specification. In addition to the main 


1 As of August 2020, exact number is Region-dependent. 
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resource type, the specification includes definitions for its subproperties, their 
types, and whether these are required. Although the example only shows string 
properties, in general, allowed properties values range over objects, arrays, and 
primitive types such as integers, doubles, longs, strings, and booleans. Listing 
1.2, on the other hand, shows a common usage scenario of the S3 storage service, 
where a bucket with basic configuration is used to store the desired data. The 
instance has logical ID ConfigS3Bucket, is of type S3::Bucket, and specifies two 
top-level properties, BucketName and LoggingConfiguration. It is easy to see 
that this instance declaration validates against the resource specification of List- 
ing 1.1. This snippet is taken from one of the benchmark deployments evaluated 
in Sect. 5 (StackSet 15) and, incidentally, it violates a security best practice: “no 
bucket should store its own logs.” Such formalization has been instrumental to 
capture infrastructure configurations, resources settings and inter-connections, 
and to precisely and automatically encode it into DL. 


Encoding. We translate IaC specifica- "ConfigS3Bucket": { 
. . : : "Type": "AWS: :S3::Bucket", 
tions into DL terminological knowledge, 


"Properties": 
and IaC configurations into assertional "BucketName": “ConfigStore”, 

è "LoggingConfiguration": { 
knowledge. The conceptual modeling fea- ipestiaatd oniucketName 
tures needed to model the former include “ConfigStore” , 

: P "LogFilePrefix": “config-bucket- 
axioms to define domain and range of logs/”}} 


properties, requiredness, and functional- 
ity. These give us enough expressivity to 
infer qualities of nodes that are under- 
specified, such as those that are referenced by a template but not declared in it 
(e.g., already deployed and running elsewhere), whose configuration is unknown. 
To give readers an intuition of the encoding procedure, let us look at the equa- 
tion below, which contains some of the axioms and assertions generated by the 
translation of the code in Listings 1.1 and 1.2. 


Listing 1.2. S3::Bucket instance 
declaration 


Specs3:.Bucket = { dom(bucketName, BUCKET), ran(bucketName, String), 
(Funct bucketName), ..., dom(destinationBucket, LOGCONFIG), 
ran(destinationBucket, BUCKET), ... } 


Config = { BUCKET(ConfigS3 Bucket), bucketName(ConfigS3Bucket, “ConfigStore”), 
loggingConfig(ConfigS3Bucket, x), destinationBucket(z, ConfigS3Bucket), 
logFilePrefix(x, “config-bucket-logs”) } 


4 Security Properties Specification 


We group properties into three categories that reflect their high-level meaning: 
security issues, mitigations, and global protections to security concerns. We view 
these in analogy to must and may specifications, which one would use to express 
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that an issue may be present (vs. must be absent) or that a protection must be 
in place (vs. may be missing). Each property type is matched to a corresponding 
query structure, which aids the translation of security requirements into formal 
specifications and implements different fail/pass logics. Queries are written as 
description logic expressions whose outcome can be one of UNSAT, SAT with no 
instance found (SAT/0), and SAT with instances (SAT/+). These are achieved 
by running a satisfiability check, possibly followed by an instance retrieval call. 


Mitigations are configurations of single resources that reduce the likelihood of a 
security event. In order to pass, these checks must be verified. Examples are: 


M1 “All buckets must keep logs,” 
M2 “Only buckets that host websites can have a public preset ACL,” and 
M3 “Data stores must have backup or versioning enabled.” 


Security Issues are configurations that potentially increase exposure to security 
concerns. In order to pass, these checks must be falsified. Examples are: 


Il “There may be a bucket that is not encrypted,” 
I2 “Encrypted bucket that sends events to a not-encrypted queue,” and 
I3 “There may be a networking component that opens all ports to all.” 


Global Protections are more general mitigations, applied on single resources 
or as configuration patterns, whose presence and proper configuration ensures 
protection over the system as a whole. Examples are: 


P1 “There is an alarm configured to perform an action when triggered,” and 
P2 “There is a configuration recorder logging changes to the infrastructure.” 


We refer the reader to the repository in [14] for the properties specification files.” 


5 Application to Existing Infrastructure 


We now discuss the application of our approach to real-world IaC deployments. 
We analyze AWS CloudFormation specification and configuration files, showing 
that the approach is practical, scalable, and identifies potential security issues. 


Operation of the Tool. We develop a tool that performs three main tasks. First, 
the encoding of the cfn resource specifications into formal models (Resource 
Terminologies).* Second, the encoding of the actual cfn configuration files, also 
called StackSet, into formal models (Infrastructure Model). Third, inference and 
query answering for a set of predefined queries. We use the OWLApi [22] for the 
encoding phase, and JFact [39] as the inference engine. 


? https: //tiny.cc/PropertiesSpecifications. 
3 Available here: https://tiny.cc/ResourceTerminologies. 
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Table 1. Evaluation results (mean times in millisec). 


IDIN Nrr ENC Na INF USAT | SATO SAT+ 
05 6 44.53) 814 30.64 | 0.67 |- 2.46 
11 8 79.22) 917 37.09 | 0.72 |- 2.86 
03) 10 7 59.94) 886 35.65 | 0.64 =| 2.23 1.56 
09 10 9 76.33 940 38.66 | 0.68 | 5.03 2.96 
02 11 8 76.73 1194 49.99 | 0.85 | 2.66 2.02 
01| 16 7 94.95 | 1007 43.38 | 0.66 | 3.96 1.83 
08 19 8 87.66) 1051 50.93 | 0.78 | 5.40 3.23 
10 30 9 89.07 | 1177 71.23 | 0.86 | 2.62 2.08 
06 30 | 12 102.00 1666 108.30 | 1.05 |- 4.91 
12) 31) 21 185.06 2798 301.61 | 4.99 | 24.93 | 36.43 
13) 51 | 32 | 241.17) 3835 608.09 | 7.16 | 38.56 | 47.93 
14 73| 31 264.56 | 4143 847.36 | 2.83 | 51.36 | 19.20 
15 79| 21 313.40 | 4596 901.18 | 2.86 |- 17.55 
04 132 | 33 | 363.58) 4834 | 2100.85 |2.94 |162.95 | 23.21 
07 508 | 21 |1005.46 | 10161 | 15834.14 | 7.34 |40.86 |13.52 


Experimental Setup. We run our tool on 15 CloudFormation StackSets openly 
available on GitHub. Regarding metrics, we define the infrastructure size as 
the numbers of both declared resources (N) and their types (Nprr). The latter 
determines which resource terminologies are imported into the final encoded 
model and thus influences its size, measured in number of logical axioms (Na). 
The smallest StackSet has 6 resources and 6 resource types, the largest has 508 
resources and 21 resource types. We implement 50 properties from the ScoutSuite 
collection [35] that are applicable at design time and, thus, over IaC deployment 
files. Of the 50 properties, 29 are mitigations, 18 are security issues, and 3 are 
global protections. We conduct our evaluation on an Intel Core i5 with 16GB 
RAM and perform warmup runs and clear the heap before each measurement. 
This tuning helps to minimize the impact of just-in-time compilation and to 
reduce the likelihood of garbage collection during the measured benchmark runs. 


Results Evaluation. The average compilation time of the entire cfn resource 
specifications (542 files) was 940 ms. Table 1 reports the results of our experi- 
mental evaluation. StackSets are sorted by number of resources. For each, we 
measure the time taken by the stackset encoding (ENC), inference (INF), and 
query answering task (grouped by outcome: UNSAT, SAT with no instances, and 
SAT with instances). As we can see from the table, the encoding time increases 
with the infrastructure’s size, producing larger models that require longer infer- 
ence times. Average query answering times increase accordingly. UNSAT queries 
have shorter average answering times than those evaluating to SAT/0 or SAT/+ 
(UNSAT proofs are found before a SAT outcome can be deduced). In addition, 
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once a query is proved SAT, we invoke a procedure for instances retrieval to 
determine whether satisfying instances are present or not. The specific infrastruc- 
ture configuration and its size are the main influencing factors of query answer- 
ing times. Considering that the average template has about 50-100 resources, 
and templates having 100-500 resources are rare, the results suggest that our 
approach scales to real-world IaC templates. For example, StackSet 04 has 132 
resources, is encoded in 363 ms, classified in 2.1s, and has a max average per- 
query time of 162ms. Assuming a pool of 100 checks to be run, the automated 
modeling and verification of such an infrastructure would take, in the worst-case, 
around 18s. 


5.1 Found Security Issues 


Across all 15 deployments, we run 15 x 50 = 750 checks: 608 pass and 142 fail. Of 
the 142 failing checks, 73 do not return any instance and 69 return one or more 
instances (i.e., they fail with a SAT/+ outcome). Such a difference is due to the 
nature of the single check and its definition of failure. A global protection check 
fails when no instance implementing the protection is found; a security issue 
check fails whenever is possible (SAT/0 or SAT/+); and a mitigation check fails 
when no instance is found. We consider SAT /+ findings particularly important, 
as they do not only witness a potential security issue but also an actual mis- 
configuration. In particular, the 69 SAT/+-failing checks fail on 239 resource 
instances, with the most found issues being: 


Missing or misconfigured encryption 131 
Missing or misconfigured logging 46 
Missing or misconfigured versioning/backup/replication 44 
Missing User password reset requirement 12 
Misconfigured authorization 3 


Misconfigured networking configuration 


The 73 findings returning no instances fall into two groups: the absence of any 
monitoring or alarming system is very frequent, as is the dependency on external 
resources whose security posture cannot be assessed. 


Absent global monitoring/alarming/logging protection 41 


Usage of external resources with unknown configuration 32 


6 Semantic Reasoning About Dataflows 


To conclude our study, we manually craft two proof-of-concept models of terms 
related to cloud security (ontologies). We use these to extend the formalization 
of the CloudFormation IaC specification that was automatically generated by 
our tool. Such domain-specific ontologies formalize several common cloud terms, 
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"TestData": { 


"CustomerData": { "Type": "AWS::S3::Bucket", 
"Type": "AWS::S3::Bucket", "Properties": { 
"Properties": { "LoggingConfig": { 

LoggingConfig": { "DestinationBucket": " 
"DestinationBucket": " AccessLog" }}} 


AccessLog" }}}, 
"AccessLog": { 


" : 4 $ my 
TopicSubscription":{ "Type": "AWS::S3::Bucket", 
"Type": "AWS::SNS::Subscription", "Properties": { 
"Properties": { "NotificationConfig" : { 
"Endpoint": "devs@mail", "TopicConfig" : { 
"Protocol": "email", 


"Topic":"AccessTopic" }}}}, 
"TopicArn": "AccessTopic" }} 


"AccessTopic": { 
"Type": "AWS::SNS::Topic" ... } 


Fig. 1. Sample template: accounts prod (left) and test (right). 


such as account, deployment, authenticated and unauthenticated users; generic 
dataflow terms, such as storage, process, nodes, and flows of different kind; and 
service-specific dataflow terms. By adding these on top of the underlying IaC 
formal specification, we can reason about the higher-level business logic and 
reachability of the infrastructure, and we can abstract it and visualize it in a 
more convenient way. This is where the full inference power of description logics 
comes into play. Such an inference power would be hard to achieve with an alter- 
native encoding (e.g., using a modal logic). Let us illustrate how this technique 
is applied to system-level analyses of interest for a security review: dataflow 
and trust boundary analyses. A trust boundary is a portion of a system whose 
components trust each other and where data can securely flow. Multiple trust 
boundaries may exist within one system. Dataflows that travel across bound- 
aries may introduce security issues and should be carefully reviewed. In Fig. 1, 
we see an example of such a situation, where the infrastructure is deployed across 
two accounts, prod and test, sharing resources AccessLog and AccessTopic. In 
our encoding, we use the so-called DLs inclusion axioms to rewrite properties 
that (when chained) imply the existence of a more general relation and to infer 
additional characteristics of nodes. For example, in the following list axioms 2— 
7 formalize the relationships of “logging to” and “sending notifications to” a 
resource, which imply the existence of a transitive dataflow between nodes; and 
axioms 8-9 allow to infer that the node devs@mail is an external node. 


LoggingConfig o DestinationBucket C logsTo 


TopicArn™ o Endpoint C sendsNotifications 


NotificationConfig o TopicConfig o Topic E sendsNotifications 
sendsNotifications E dataflow 
dataflow o dataflow E dataflow 


Protocol. { “email” } E VEndpoint.EmailAddress 
EmailAddress E ExternalNode 


( 
( 
( 
logsTo E dataflow (4 
( 
( 
( 
( 
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This encoding enables us to compute a succinct dataflow diagram from 
the reasoned IaC configuration (see Fig.2), and to formally verify properties 
that usually require a manual analysis of the infrastructure and its underlying 
graph representation. E.g., the question, “can data flow from the customer-data 
bucket to the outside?” can now be formalized as a DL formula and, using a 
reasoning engine, the existence of a 


st 


dataflow that starts on the customer- a ile i p 

data bucket and reaches the devs@mail G ' = G 
node can now be inferred. We Simu See sabut 
note that, due to the structure of Siaa 

the TopicSubscription resource, this ee 


dataflow could not have been detected devs-mail | accesstopic 

with simple reachability analysis on a seca 

graph built without the aid of seman- 

tic reasoning. Moreover, the dataflow Fig. 2. Dataflow extracted from Fig. 1 
diagram highlights another potential 

source of information leakage: testers being exposed to customer access infor- 
mation. This needs to be mitigated by enforcing the proper trust boundaries, in 
particular, by adding a dedicated access log storage for customer-data bucket in 
the prod account. 


7 Related Work 


To the best of our knowledge, the problem of formally verifying the design of a 
cloud infrastructure in its entirety has not been addressed before. Formal reason- 
ing techniques have been successfully applied to different aspects of the cloud, 
e.g. networks and access policies [4,5,7,16]. Non-formal tools exist that recom- 
mend and run checks against already deployed resources [13,35], or scan IaC tem- 
plates [10,11,38] for syntactical patterns violating security best practices. These 
checks overlap considerably and can be expressed in our framework as well. 
The disadvantages of such tools are that checks are local to single components, 
can be performed only post-deployment, need complex configurations, access 
permissions, or even manual interaction. The CFn-Linter [10] has a rule-based 
component that users can extend with custom syntax checks, but none of the 
rules currently available focus on security. The CFn-nag linting tool [11] checks 
compliance to best practices only locally to the single resources; e.g., it cannot 
detect issues such as “there is an events queue, receiving from a bucket with 
critical functionality, that may not be encrypted” or “there might be a user that 
is shared by multiple policies” (which would go against the least privilege prin- 
ciple); as well as including in its analysis external resources that are referenced 
by the template being linted. 

Regarding our choice of logic, large-scale configuration problems have been 
tackled with description logic before [26,27]. Simpler first-order logic formulas 
with operators to represent object-oriented interface relationships could be used 
to model IaC specifications. However, such an encoding would only partially 
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solve our problem, which is more complex because our overall goal is to do 
formal semantic analyses (e.g., dataflow and threat modeling). Semantic-based 
approaches, even DL-based, are being used to do conceptual modeling of security 
engineers’ expertise with the provable and explainable inference capabilities of 
logics. As an example, we refer the reader to the OWASP “Ontology-driven 
Threat Modeling” project [31] that aims at the formalization of security-related 
knowledge in the context of different types of computer systems by means of 
description logic ontologies. In contrast to logic programming languages, such 
as Datalog, DLs inherently support functionality axioms and the existence of 
anonymous individuals within a domain that is assumed to be open. These are 
supported out-of-the-box without the need for an additional, more complex, 
axiomatization or encoding. In particular, we took advantage of DL’s open- 
world assumption to implement, in our properties encoding, verification and 
falsification. Another alternative to DLs as a modeling language would be to use 
3-valued models with labels on states and transitions and apply model checking 
[8,9]. However, expressive branching-time logics [25,33] have not been studied 
in the context of 3-valued models and we are also not aware of tool support at 
the level available for DLs (cf. [17,21]). 


8 Conclusion and Future Work 


Throughout this case study, we investigated the usage of description logics- 
based semantic reasoning to evaluate the security of cloud infrastructure pre- 
deployment. We encoded Amazon Web Services’ Infrastructure as Code specifi- 
cations and configurations into description logic models and verified the presence 
and absence of potential security issues. We showed how this approach enables 
deeper system-level analyses such as dataflow analysis. All results can be gen- 
eralized to other existing IaC tools. While working on this project, we inter- 
acted with developers on two occasions. First, for the benchmark templates used 
in our experimental evaluation, we contacted the owners, told them about the 
misconfigurations, and discussed potential security implications. Second, within 
AWS, security engineers use a technique based on this paper for security reviews 
of AWS products before they are launched, helping developers fix real issues 
pre-deployment. In the process, we received valuable feedback that we used for 
improving precision and reducing the number of false-positive results. We plan 
to continue researching for an even better-fitting description logic formalism, 
query language, three-valued semantics, and decision procedures for verification 
and falsification of properties relevant to security analyses, such as dataflows, 
trust boundaries, and threat modeling. 
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Abstract. We present a method for synthesizing recursive functions 
that satisfy both a functional specification and an asymptotic resource 
bound. Prior methods for synthesis with a resource metric require 
the user to specify a concrete expression exactly describing resource 
usage, whereas our method uses big-O notation to specify the asymp- 
totic resource usage. Our method can synthesize programs with complex 
resource bounds, such as a sort function that has complexity O(n log(n)). 
Our synthesis procedure uses a type system that is able to assign an 
asymptotic complexity to terms, and can track recurrence relations of 
functions. These typing rules are justified by theorems used in analysis 
of algorithms, such as the Master Theorem and the Akra-Bazzi method. 
We implemented our method as an extension of prior type-based synthe- 
sis work. Our tool, SYNPLEXITY, was able to synthesize complex divide- 
and-conquer programs that cannot be synthesized by prior solvers. 


1 Introduction 


Program synthesis is the task of automatically finding programs that meet a 
given behavioral specification, such as input-output examples or complete for- 
mal specifications. Most of the work on program synthesis has been devoted to 
qualitative synthesis, i.e., finding some correct solution. However, programmers 
often want more than just a correct solution—they may want the program that 
is smallest, most likely, or most efficient. While there are some techniques for 
adding a quantitative syntactic objective in program synthesis [12]—e.g., finding 
a smallest solution, or a most likely solution with respect to some distribution— 
little attention has been devoted to quantitative semantic objectives—e.g., syn- 
thesizing a program that has a certain asymptotic complexity. 

Recently, Knoth et al. [16] studied the problem of resource-guided program 
synthesis, where the goal is to synthesize programs with limited resource usage. 
Their approach, which combines refinement-type-directed synthesis [18] and 
automatic amortized resource analysis (AARA) [9], is restricted to concrete 
resource bounds, where the user must specify the exact resource usage of the 
synthesized program as a linear expression. This limitation has two drawbacks: 
(i) the user must have insights about the coefficients to put in the supplied 
© The Author(s) 2021 
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bound—which means that the user has to provide details about the complex- 
ity of code that does not yet exist; (ii) the limitation to linear bounds means 
that the user cannot specify resource bounds that involve logarithms, such as 
O(log n) and O(nlogn), common in problems based on divide and conquer. 

In this paper, we introduce SYNPLEXITY, a type-system paired with a type- 
directed synthesis technique that addresses these issues. In SYNPLEXITY, the 
user provides as input a refinement type that describes both the functionality 
and the asymptotic (big-O) resource usage of a program. For example, a user 
might ask SYNPLEXITY to synthesize an implementation of a sorting function 
with resource usage O(nlogn), where n is the length of the input list. As in 
prior work, SYNPLEXITY also takes as input a set of auxiliary functions that the 
synthesized program can use. SYNPLEXITY then uses a type-directed synthesis 
algorithm to search for a program that has the desired functionality, and satisfies 
the asymptotic resource bound. SYNPLEXITY’s synthesis algorithm uses a new 
type system that can reason about the asymptotic complexity of functions. To 
achieve this goal, this type system uses two ideas. 


1. The type system uses recurrence relations instead of concrete resource poten- 
tials [9] to reason about the asymptotic complexity of functions. For example, 
the recurrence relation T(u) < 2T(|$]) + O(u) denotes that on an input of 
size u, the function will perform at most two recursive calls on inputs of size 
at most [5], and will use at most O(u) resources outside of the recursive 
calls.! For a given recurrence relation, our type system uses refinement types 
to guarantee that a function typed with this recurrence relation performs the 
correct number of recursive calls on parameters of the appropriate sizes. 

2. These typing rules are justified by classic theorems from the field of analysis 
of algorithms, such as the Master Theorem [5], the Akra-Bazzi method [1], 
or C-finite-sequence analysis [13]. 


Guéneau et al. observed that reasoning with O-notation can be tricky, and 
exhibited a collection of plausible-sounding, but flawed, inductive proofs [8, §2]. 
We avoid this pitfall via SYNPLEXITY’s type system, which establishes whether 
a term satisfies a given recurrence relation. SYNPLEXITY uses theorems that 
connect the form of a recurrence relation—e.g., the number of recursive calls, 
and the argument sizes in the subproblems—to its asymptotic complexity. In 
particular, the SYNPLEXITY type system does not encode inductive proofs of 
the kind that Guéneau et al. show can go astray. 

SYNPLEXITY can synthesize functions with complexities that cannot be han- 
dled by existing type-directed tools [16,18], and compares favorably with existing 
tools on their benchmarks. Furthermore, for some domains, SYNPLEXITY’s type 
system allows us to discover auxiliary functions automatically (e.g., the split 
function of a merge sort), instead of requiring the user to provide them. 


1 The recurrence relation above is one possible instantiation of the Master Theorem [5, 
§4.5 and §4.6]; it can also be instantiated as T (u) < 2T([$]) + O(u). The type system 
makes use of certain templates for instantiating the algorithm-analysis theorems that 
we use. The use of templates means that the type system does not use all possible 
instantiations, but all instantiations used in the type system are valid ones. 
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Contributions. The contributions of our work are as follows: 


— A type system that uses refinement types to check whether a program satisfies 
a recurrence relation over a specified resource (Sect. 3). 

— A type-directed algorithm that uses our type system to synthesize functions 
with given resource bounds (Sect. 4, Sect. 5). 

— SYNPLEXITY, an implementation of our algorithm that, unlike prior tools, 
can synthesize programs with desired asymptotic complexities (Sect. 6). 


Complete proofs and details of the type system can be found in the technical 
report [11]. 


2 Overview 


In this section, we illustrate the main components of our algorithm through an 
example. Consider the problem of synthesizing a function prod that implements 
the multiplication of two natural numbers, x and y. We want an efficient solution 
whose time complexity is O(log x) with respect to the value of the first argument 
x. In Subsect. 2.1, we show how existing type-directed synthesizers solve this 
problem in the absence of a complexity-bound constraint. In Subsect. 2.2, we 
illustrate how to specify asymptotic bounds in type-directed synthesis problems. 
In Subsect. 2.3, we show how the tracking of recurrence relations can be used to 
establish complexity bounds as well as guide the synthesis search. 


2.1 Type-Directed Synthesis 


We first review one of the state-of-the-art type-directed synthesizers, SYNQUID, 
through the aforementioned example—i.e., synthesizing a program prod that 
computes the product of two natural numbers. In SYNQUID, the specification is 
given as a refinement type that describes the desired behavior of the synthesized 
function. We specify the behavior of prod using the following refinement-type: 


prod ::x:{Int | v > 0} > y:{Int | v> 0} > {Int | v=xxy}. 


Here the types of the inputs x and y, as well as the return type of prod are 
refined with predicates. The refinement {Int | v > 0} declares x and y to be 
non-negative, and the refinement {Int | v = x xy} of the return type declares the 
output value to be an integer that is equal to the product of the inputs x and y. 
In addition to the specification, the synthesizer receives as input some signatures 
of auxiliary functions it can use. The specifications of auxiliary functions are also 
given as refinement types. In our example, we have the following functions: 


even :: x:Int — {Bool |x mod 2 = 0} dec :: x: Int > {Int | v=x- 1} 
x 
double ::x:Int > {Int | v= x +x} div2 ::x:Int > {Int | v= l3 


plus ::x:Int > y:Int > {Int | v=x +y} 
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With the above specification and auxiliary functions, SYNQUID will output 
the implementation of prod shown in Eq. (1). 


prod = Ax.Ay. if x==0 then x else plus y (prod (dec x) y) (1) 


SYNQUID uses a sophisticated type system to guarantee that the synthesized 
term has the desired type. Furthermore, SYNQUID uses its type system to prune 
the search space by only enumerating terms that can possibly be typed, and 
thus meet the specification. Terms are enumerated in a top-down fashion, and 
appropriate specifications are propagated to sub-terms. As an example, let us see 
how SYNQUID synthesizes the function body—an if-then-else term—in Eq. 
(1), which is of refinement type {Int | v= x*y}. SYNQUID will first enumerate an 
integer term for the then branch—a variable term x. Then, with the then branch 
fixed, the condition guard must be refined by some predicate y under which the 
then branch (the term x refined by v = x) fulfills the goal type {Int | v= x*y}, 
i.e., Vx,y > 0.p Av =x = v=x*y. With this constraint, SYNQUID identifies 
the term x == 0 as the condition. Finally, SYNQUID propagates the negation of 
the condition to the else branch—the else branch should be a term of type 
{Int | v= x xy} with the path condition x 4 0—and enumerates the term plus 
y (prod (dec x) y) as the else branch, which has the desired type. 

The program in Eq. (1) is correct, but inefficient. Let us count each call to an 
auxiliary function as one step; and let T(x) denote the number of steps in which 
the program runs with input x. The implementation in Eq. (1) runs in O() steps 
because T(x) satisfies the recurrence T(x) = T(a—1)+2, implying T(x) € O(z). 
Because, SYNQUID does not provide a way to specify resource bounds, such as 
O(log x); one cannot ask SYNQUID to find a more efficient implementation. 


2.2 Adding Resource Bounds 


In our tool, SYNPLEXITY, one can specify a synthesis problem with an asymp- 
totic resource bound, and can ask SYNPLEXITY to find an O(log x) implementa- 
tion of prod. To express this intent, the user needs to specify (1) the asymptotic 
resource-usage bound the synthesized program should satisfy, (2) the cost of 
each provided auxiliary function, and (3) the size of the input to the program. 


Asymptotic Resource Bound. We extend refinement types with resource annota- 
tions. The annotated refinement types are of the form (7; a) where 7 is a regular 
refinement type, and a is a resource annotation. The following example asks the 
synthesizer to find a solution with the resource-usage bound O(log u): 


prod :: (x:{Int | v > 0} > y: {Int | v > 0} > {Int | v= x x y}, O(log u)) 


Cost of Auxiliary Functions. The auxiliary functions supplied by the user serve 
as the API in terms of which the synthesized program is programmed. Thus, the 
resource usage of the synthesized program is the sum of the costs of all auxiliary 
calls made during execution. We allow users to assign a polynomial cost O(u*), 
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for some constant a, or a constant cost O(1) to each auxiliary function. Here, u 
is a free variable that represents the size of the problem on which the auxiliary 
function is called. 

In the prod example, all auxiliary functions are assigned constant cost, e.g., 
we give even the signature even :: (x: Int — {Bool | x mod 2 = 0}, O(1)). 


Size of Problems. The user needs to specify a size function, size:7 — Int, that 
maps inputs to their sizes, e.g., when synthesizing the sorting function for an 
input of type list, the size function can be A1.|/|—the length of the input list. 
In the prod example, the size function is size = Ax.Ay.x. 


2.3 Checking Recurrence Relations 


We extend SYNQUID’s refinement-type system with resource annotations, so that 
the extended type system enforces the resource usage of terms. The idea of the 
type system is to check if the given function satisfies some recurrence relation. If 
so, it can infer that the function also satisfies the corresponding resource bound. 
For example, according to the Master Theorem [3], if a function f satisfies the 
recurrence relation T(u) < T([3]) + O(1) where u is the size of the input, then 
the resource usage of f is bounded by O(log u). Checking if a function satisfies a 
given recurrence relation can be performed by checking if the function contains 
appropriate recursive calls—e.g., if a function contains one recursive call to a 
sub-problem of half size, and consumes only a constant amount of resources in 
its body, then it satisfies T(u) < T(|$J) + O(1). 

The following rule is an example of how we connect recurrence annotations 
and resource bounds. 

u 


Z: f: Tn > T, l Ft: (T2; ([1, L31) 00) 
I} (fix f. Ax.t) :: (Tı > T2; O(log u)) 


The rule instantiates the Master Theorem example above. Note that, the anno- 
tation ([1, |3]], O(1)) states that the function body contains up to one recursive 
call to a problem of size | $|, and the resource usage in the body of t (aside from 
calls to f itself) is bounded by O(1). The rule states that if the function body 
t of type 72 contains one recursive call to a sub-problem of size |3], then the 
function will be bounded by O(log u). 


The implementation of prod shown in Eq. (2) runs in O(log x) steps. 
prod = Ax. Ay.if x == 0 then x else (2) 


if even x then double (prod (div2 x) y) 
else plus y (double (prod (div2 x) y)) 


To check that, SYNPLEXITY’s type system counts the number of recursive calls 
along any path of the function. There are three paths (two nested if-then-else 
terms) in the program, and at most one recursive call along each path. Also, 
one can check that the problem size of each recursive call is no more than |3]. 
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Term t= e|b 

E-term es 

Branching term b:: 
I-term 


x |c | true | false | xre1...en 

if e then t else t 

|match e with |; C; (zł... £7) > ti 
Function term Fem fix f.rAm1...ATp.t 


Fig. 1. SYNPLEXITY syntax. 


Logical expr. y, Y = cl{[m(v)|T|Lle|dmodd|vAv|uvvu 
| i y=y| yyl yyl yty v- 


Ordinary type B Bool | Int | D 

Refinement type T= {B|} | ain... >En: Tn >yY:T 

Annotated type yu= Tay 

Recurrence ann. an= ([c1, dije,---; [Cn dnje; OW) ) 

Environment LF := |x: I |y; I | recFun:=2;I | args := z1... £n; l 


Fig. 2. SYNPLEXITY types. 


For example, the recursive call prod (div2 x) y calls to a problem with size 
div2 x, which is consistent with [1,| J], and u is x because size x y = x. In 
addition, the condition that the resource usage of the body is bounded by O(1) 
is satisfied because only auxiliary functions with constant cost are called. 


3 The SYNPLEXITY Type System 


In this section, we present our type system. First, we give the surface lan- 
guage and the types, which extend the SYNQUID liquid-types framework with 
resource annotations (Subsect. 3.1). Then, we show the semantics of our language 
(Subsect. 3.2). Finally, we present SYNPLEXITY’s type system (Subsect. 3.3), 
which our synthesis algorithm uses to synthesize programs with desired resource 
bounds. 


3.1 Syntax and Types 


Syntax. Consider the language shown in Fig. 1. In the language, we distinguish 
between two kinds of terms: elimination terms (E-terms) and introduction terms 
(I-terms). E-terms consist of variable terms, constant values c, and application 
terms. Condition guards and match scrutinies can only be E-terms. I-terms are 
branching terms and function terms. The key property of I-terms is that if the 
type of any I-term is known, the types of its sub-terms are also known (which is 
not the case for E-terms). 


Types. Our language of types, presented in Fig. 2, extends the one of SYN- 
QUID [18] with recurrence annotations, which are used to track recurrence rela- 
tions on functions. To simplify the presentation, we ignore some of the features of 
the type system of SYNQUID [18] that do not affect our algorithm. In particular, 
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we do not discuss polymorphic types and the enumerating strategy that ensures 
that only terminating programs are synthesized. However, our implementation 
is built on top of SYNQUID, and supports both of those features. 

Logical expressions are built from variables, constants, arithmetic operators, 
and other user-defined logical functions. Logical expressions in our type system 
can be used as refinements y, size expressions ¢, or bound expressions w. Refine- 
ments y are logical predicates used to refine ordinary types in refinement types 
{B | vy}. We usually use a reserved symbol v as the free variable in y, and let 
v represents the inhabitants, i.e., inhabitants of the type {B | y} are valuations 
of v that satisfy y. For example, the type {Int | v mod 2 = 0} represents the 
even integers. Size expressions and bound expressions are used in recurrence 
annotations, and are explained later. 

Ordinary types includes primitive types and user-defined algebraic datatypes 
D. Datatype constructors C are functions of type Tı > ... —>Tn —> D. For exam- 
ple, the datatype List (Int) has two constructors: Cons : Int > List (Int) > 
List (Int), and Nil : List(Int). Refinement types are ordinary types refined 
with some predicates Ww, or arrow types. Note that, unlike SYNQUID’s type sys- 
tem, SYNPLEXITY’s type system does not support higher-order functions?—i.e., 
arguments of functions have to be non-arrow types. All occurrences of 7; and T 
in arrow types 21:7, ... > 2n:T—Yy:T have to be ordinary types or refined 
ordinary types. We will discuss this limitation in Sect. 7. 

We use recFun to denote the name of the function for which we are perform- 
ing type-checking, and args to denote the tuple of arguments to recFun. For 
example, in the function prod shown in Eq. (1), recFun=prod and args=x y. 
An environment I" is a sequence of variable bindings x : y, path conditions y, 
and assignments for variables recFun and args. 


Recurrence Annotations. Annotated types are refinement types anno- 
tated with recurrence annotations. A recurrence annotation is a pair 
([e1, diJe,---5[Cn; On]z; O(W)) consisting of (1) a set of recursive-call costs of 
the form [c;,¢;]¢, and (2) a resource-usage bound of the form O(w). Intu- 
itively, a recurrence annotation tracks the number c; of recursive calls to f 
of size ¢; in the first element [c1, di]z,-.-,[Cn, nle of the pair, as well as the 
asymptotic resource usage of the body of the function (the second element 
O(w)). Using these quantities, we can compute a recurrence relation describ- 
ing the resource usage of the function recFun. For example, the recurrence 
annotation ({1, u— 1]z,[1, u — 2]; O(1)) corresponds to the recurrence relation 

A recursive-call cost |c, é]¢ associated with a function f denotes that the body 
of f can contain up to c recursive calls to subproblems that have sizes up to the 
one specified by size expression ¢. A size expression, ¢, is a polynomial over a 
reserved variable symbol u that represents the size of the top-level problem. In 
our paper, a problem with respect to a function g :: 21:71, ... >En: Tn > Y:T 
is a tuple of terms e1 . . . €n, to which g can be applied—i.e., e; has type 7; for all 


2 However, the type system can be extended to support restricted higher-order func- 
tions (Sect. 5). 
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i from 1 to n. For the problems of function g, the size of each problem is defined 
by a size function size,—a user-defined logical function that has type 71 > 
... Tn `> Int; i.e., it takes a problem of g as input and outputs a non-negative 
integer. In the body of g, we say that a recursive-call term g €1...€n satisfies a 
size expression @ if for all x1, ..., £n, Sizeg [ei]... [en] < [(sizeg v1... 2n)/uld, 
where the «;’s are the arguments of g and the [e;]’s are the evaluations of e; 
on input %1...%,. (See Sect. 3.2 for the formal definition of [-].) Note that one 
annotation can contain multiple recursive-call costs, which allows the function 
to make recursive calls to sub-problems with different sizes. We often abbreviate 
(T, (O(1))) as 7 and omit f in recursive-call costs if it is clear from context. 

A resource bound O(7) of a non-arrow type specifies the bound of the 
resource usage strictly within the top-level-function body. A resource bound 
in a signature of an auxiliary function f specifies the resource usage of f. Bound 
expressions b in O(w) are of the form ulog? u + c where a, b, and c are all 
non-negative constants, and u represents the size of the top-level problem. 


Example 1. In the function prod (Eq. (2)), the recursive-call term prod (div2 
x) y satisfies the recursive-call cost |1, | 5]|, because sizeprog = Az.Aw.z, and 


SiZeproa [(div2 x)] [y] = [div2 x] = lS | = [(sizepeoa x y)/u] lS ji 


3.2 Semantics and Cost Model 


We introduce the concrete-cost semantics of our language here. The semantics 
serves two goals: (1) it defines the evaluation of terms (i.e., how to obtain values), 
which can be used to compute the sizes of problems in application expressions, 
and (2) it defines the resource usages of terms. 

Besides the syntax shown in Fig. 1, implementations of auxiliary functions 
can contain calls to a tick function tick(c,t), which specifies that c units of a 
resource are used, and the overall value is the value of t. Note that in our synthesis 
language, we are not actually synthesizing programs with tick functions. We 
assume that tick functions are only called in the implementations of auxiliary 
functions. In the concrete-cost semantics, a configuration (t, C} consists of a term 
t and a nonnegative integer C denoting the resource usage so far. The evaluation 
judgment (t, C) © (t',C+Ca) states that a term t can be evaluated in one step 
to a term (or a value) t’, with resource usage C'a. We write (t, C) =>* (t', C+Ca) 
to indicate the reduction from t to t’ in zero or more steps. All of the evaluation 
judgments are standard. Here we show the judgment of the tick function, where 
resource usage happens. 


SEM-T 
(tick(c,f),C) (C+. E 


For a term t, [t] denotes the evaluation result of t, i.e., (t,-) * (ftl, >. 
Example 2. Consider the following function that doubles its input. 


fix double.\x.if x = 0 then 0 else tick(1,2 + double(x-1)). 
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Let tpoay denote the function body if x=0 then 0 else tick(1,2+double(x-1)). 
The result of evaluating double on input 5 is 10, with resource usage 5. 
((fix double.Ax.tpyoay)5, 0) 

— (if 5=0 then 0 else tick(1,2+double(4)), 0) 

~ (if false then 0 else tick(1,2+ double(4)),0) 

<> (tick(1,2+double(4)),0) — (2+double (4), 1) 

— (2+(£ix double.Ax.tpoay) 4, 1) —>* (4+double (3), 2) —* (10+double (0), 5) 

— (10+(if 0=0 then 0 else tick(1,2+double(0-1))),5) 

<> (10+(if true then 0 else tick(1,2+double(0-1))),0) <> (10+0, 5) 


With the standard concrete semantics, the complexity of a function f is 
characterized by its resource usage when the function is evaluated on inputs of 
a given size. 


Definition 1 (Complexity). Given a function fix f.r¥.t of type : Tı > Tə, 
with size function sizes : Tı —> N, and suppose that for any possible input T, 
the configuration ((fix f.XJ.t)T,0) can be reduced to (v,Cz) for some value 
v. Then, if Ty : N > N is a function such that, for all, u > 0, Ty(u) = 
SUPZ s.t. sizes(z)—u Cz, we say that Ty is the complexity function of f. 


Note that Definition 1 assumes that the top-level term (fix f.Ay.t)¥ can be 
reduced to some value. Thus, Definition 1 only applies to terminating programs. 


Definition 2 (Big-O notation). Given two integer functions f and g, we say 
that f dominates g, i.e., g E€ O(f), if 3c, M > 0. Vx > c. g(x) < M f(x). 


In the rest of the paper, we use Ty to denote the complexity function of the 
function f, and we say the complexity of f is bounded by a function g if Ty € 
O(g). As an example, the complexity of the double function shown in Example 
2 is Taoubie(u) := u, and hence Taouwiel u) € O(u). 


Auxiliary functions. We allow users to supply signatures for auxiliary func- 
tions, instead of implementations. It is an obligation on users that such sig- 
natures be sensible; in particular, when the user gives the signature (7, > 
{B | y(v,7)}, O(v(u))) for auxiliary function f, the user asserts that there exists 
some implementation fix f.Ay.t of f, such that: 1) for any input 7, the output 
of f on @ satisfies y, i.e., p([(fix f.àyg.t)z], T) is valid; and 2) for any input 
T, the complexity of f is bounded by (wu), i.e., Tp(u) E€ O(y(u)). Signatures 
always over-approximate their implementations, as illustrated by the following 
example. 


Example 3. The signature doubleRelaxed :: (x: Int > {Int | v < 32x}, O(u?)) 
describes an auxiliary function that computes no more than the input times 
3, and has quadratic resource usage. Note that the function double shown in 
Example 2 can be an implementation of this signature because [double(Z)] = 
2*a <3*~2, and the complexity function Tyouie(u) = u is in O(u?). 


792 Q. Hu et al. 


3.3 Typing Rules 


The typing rules of SYNPLEXITY are inspired by bidirectional type checking [17] 
and type checking with cost sharing [16]. Recall that we use recFun to denote 
the name of the function for which we are performing type-checking, and args 
to denote the tuple of arguments to recFun. 

An environment I’ is a sequence of variable bindings of the form z : y, 
path conditions y, and assignments of the form x = ọ for recFun and the 
components of args. SYNPLEXITY’s typing rules use three judgments: 1) I F 
t:: y states that t has type y, 2) I F q1 <: y2 states that y2 is a subtype of %1, 
and 3) I F y Y y1|72 states that yı and y2 share the costs in 7 


Subtyping. The subtyping relations between refinement types are relatively 
standard and can be found in the technical report [11]. The subtyping relations 
between annotated types allow us to compare resource consumption of recurrence 
annotations. The following is the rule for comparing recursive-call costs. 


d >c TK. p >o 
TEH leg) <: [eg] 


<:-REC 


For example, if one branch of some branching term has type (r, ([1, | 3 |], O(¢))), 
it can be over-approximated by a super type (r, ([1,|5]],O(w))). The idea is 


that the resource usage of an application calling to a problem of size | 5 | will be 


larger than the resource usage of the application calling to a smaller problem of 
size |$] (assuming all resource usages are monotonic). 

Subtyping rules also allow the type system to compare branches with a dif- 
ferent number of recursive calls. For example, base cases of recursive proce- 
dures have no recursive calls, and thus have types of the form (7, ([],O(w))). 
With subtyping, these types can be over-approximated by types of the form 


(r, (le, $], O()))- 


Cost Sharing. When a term has more than one sub-term in the same path, 
e.g., the condition guard and the then branch are in the same path in an ite 
term, the recursive-call costs of the term will be shared among its sub-terms. The 
sharing operator œ Y a1|Q@2 partitions the recursive-call costs of a into a; and 
a2—i.e., the sum of the costs in a, and ag equals the cost in a. The following 
is the sharing rule for a single recursive-call cost: 


C1,€2 > 0 Cy tog <c 


Ir [c, 0 Y [c1, ġ] | [c2, 9 


S-POT 


Other sharing rules can be found in the technical report [11]. The idea is that 
a single cost c can be shared as two costs cı and c2 such that their sum is no 
more than c. An annotation can be shared as two parts if every recursive cost 
[ci, Qi] in it can be shared as two parts [c}, 1] and [c?, ġ2]. Finally, annotations 
can also be shared as more than two parts. 
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Table 1. Annotations that can be used to instantiate the rule T- ABs. 


Bound (B) | Recurrence relation Annotation (A) 


Master Theorem | O(log u) T(u) < T(L4$J)+O0(1), d>2 (f1, L$JJ;O(1)), d> 2 


ulogu) |T(u)<dT(Lg))+O(u), d22 | ([d, GJ]; O(u)), d> 2 


( 
Akra-Bazzi O(ulogu) | Tu) < T((3]) +T(L3]) + O(u) (4, [21b B bel]; OC) 
C-Finite Seq. O(u) T(u) < T(u — d) + O(1), d> 1 |([1,u— d]; O(1)), d> 1 
o) | Tw) < T(u—d) FO(w), d>1 (u-d; Ow), d>1 


Example 4. There are multiple ways to share the recurrence annotation 


(i LL 1 FT; Ow): 
Pr (1, LSI 1k OC) ¥ (E, L51) L, T51; 06) | (O), 


where one annotation contains both recursive-call costs [1, | $|], [L, ||]; and the 
other contains no recursive-call cost. And 


Pr (i, L51) [51s 06) ¥ (51k Ow) | (151s Ow), 


where each annotation contains one recursive-call cost. 


Function Terms. The rule T-ABS shown below is really a rule-schema that is 
parameterized in terms of an annotation (A) for a function body t, and a resource 
bound (B) for the function term. If the function body t has some recurrence 
relation described by the annotation A, then the function f will satisfy the 
resource-usage bound B. Some example patterns are shown in Table 1.3 


I" = |recFun + flfargs + z1... £n] I 
yf = (£21: T1 >... > Tn : n > T,(B)) 
Tona eo a atea 
I F fix f.Axı...A£n.t ss (£1: T1 >... > Tn : Tn > T,(B)) 


T-ABS 


For example, if the annotation of the function body is ([1, |3]]; O(1)), then the 
resource bound in the function type will be O(log u), i.e., the resource usage of 
f is bounded by O(log(sizes z1... £n)). 

At the same time, the rule stores the name f of the recursive function into 
recFun, and its arguments as a tuple into args. 


Example 5. We use a function fix bar.Av.if x = 1 then 1 else 1+bar(div2 x) 
to illustrate the first pattern in Table 1. The body of bar has the annotated type 
({1, | $ J]; O()) because (i) there exists only one recursive call to a sub-problem 
whose size is half of the top-level problem size u, and (ii) the resource usage 
inside the body is constant (with the assumption that all auxiliary functions 


3 The patterns shown in Table1 are those we used in the implementation. Patterns 
capturing other recurrence relations can be added to the type system if needed. 


794 Q. Hu et al. 


have constant resource usage). This type appears in row 1, column 4 of Table 1. 
Consequently, the recurrence relation of bar is T(u) < T(|3]) + O(1) (row 1, 
column 3), where T(u) is the resource usage of bar on problems with size u. 
Finally, according to the Master Theorem, the resource usage of bar is bounded 
by O(log u) (row 1, column 2). 


Branching Terms. In rule T-IF, the condition has type Bool with refinement 
Pe. Two branches have different types—the then branch follows the path condi- 
tion Ye, and the refinement y of the branch term, while the else branch follows 
the path condition ~age. By having both branches share the same recurrence 
annotation, T-IF can introduce some imprecision. In particular, if the branches 
belong to different complexity classes, the annotation of the conditional term 
will be the upper bound of both branches. 


TraYajlag I e:: ({Bool | ye}, a1) 


T pet ti :: AB | p}a2) T, mye t tes AB |p}, a2) 


T-IF 
IT F if e then t; else t2 : ({B | y}, a) 


The rule T-MATCH is slightly different: (1) there can be more than two 
branches, (2) all branches have the same type (T, œ2)}, and (3) variables in each 
case C; (x}... x?) are introduced in the corresponding branch. 


PraYajlag IF e:: (ts,a1) 
CG =n... Tn Ts Dal: T... E? tt F ty: (7,08) 


T-MATCH 
I F match e with |; C; (£}... £?) > t; © (T,a) 


E-terms. The typing rules for E-terms are shown in Fig.3. The two rules 
for application terms are the key rules of our type system. Let us first look 
at the E-RECAPP rule for recursive-call terms. Recall that the recursive-call 
annotation tracks the number of recursive calls and the sizes of sub-problems. 
If the term f €,...e, is a recursive call—i.e., I (recFun) = f—the number 
of recursive calls in one of the recursive-call costs will increase by one—i.e., 
(cr, x] in the premise becomes [ck + 1,¢,] in the conclusion. Also, we want 
to make sure that the size of the subproblem this application term is called 
on satisfies the size expression ¢;. If each callee term is refined by the pred- 
icate yi, ie, T F e = ({Bi | yit,ai) , then the fact that the size of the 
problem ¢€)...€, satisfies p can be implied by the validity of the predicate 
Ni lyi/Uei > (size y1...Ym < [size I'(args)/uld,). We introduce validity 
checking, written I’ | y , to state that a predicate expression ọ is always true 
under any instance of the environment I’. 


Example 6. Recall Eq. (2). According to the rule T-RECAPP, the recursive call 
prod (div2 x) y has type ({Int | v = |3] x y}, ([1, $]);O(1)). Note that the 
first argument (div2 x) has type {Int | v= |4]}, the second argument y has 
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Heny Thy <: T — 
aa i Y B-SuBTYPE TOST av 
Treny Pre:+y PP gry 


Dg: (a1:11 4... 92m:T™ > {B | p}, (O(vs))) 


P(recFun) g PH (len Qil ---1-+3 [ems Pak OG) Y oa. [orm 
VI<i<m Dre :: {Bi | gi}, a) CF {Bi | pi} <: Ti 


F= Nbi/dyi= ([sizee yı ---Ym/ulYg € O([size I'(args) /ul)) 


T= {B | la/alen Nulled z € FV ye) z ¢ FV (g) 


i=1 E-A 
T F geie cemi: (7, ([c1, b1],--+5 [Cn, bn]; O) 


ChE: (ain... 92m:tm > {B | v}, a) I (recFun) = f 


T F (fer, %1], ---, [cr Pr], - - -> [Cn Pn]; O(W)) Y aa]... [am 
Vli<i<m I Fei: {Bi | pi}, ai) DCE {B; | pi} <: Ti 


TE [\Wwi/dvi= (size Yı ---Ym < [size (args) /u]¢x) 


T = {B | [zi /xily ^ Nle/eleit zi $ FV (p), zi £ FV(¢i) 


E-RECAPP 
TPF fer... €m % (7, (len, bis» [ee +1, Ge], «+s [ens Pnl; OW))) 


Fig. 3. Typing rules of E-terms 


type {Int | v= y}, the size function is sizeprog = Az.Aw.z, and the arguments 
in the context are I'(args) = x y. Therefore, the following predicate is valid: 


[n/A(v= 5 J) A ly2/al(v = y) > sizeprea Yı y2 = [Sizeproa M(args/u)]|5 | 


e (m= 15) AG =v) em = 15]: 


The rule E-APP states that callees have types 7;, and the resource 
usage does not exceed the bound O(7) in the annotation. Similar to the 
E-RECAPP rule, the size of the problem g calls to is [sizeg y1...Ym/ul 
with the premise /\i",[yi/v]~i. The validation checking Aj”,[yi/vyi => 
([sizeg 41 ---Ym/uvg € O([size I'(args)/uly)) in the rule states that for any 
instance of I’, the size of the problem in the application term is in the big-O 
class O([size I (args)/uļy). Note that the membership of big-O classes can 
be encoded as an SV query. The query is non-linear, and hence undecidable in 
general. However, we observed in our experiments that for many benchmarks 
the query stays linear. Furthermore, even when the query is non-linear, existing 
SMT solvers are capable of handling many such checks in practice. 
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3.4 Soundness 


We assume that the resource-usage function ~ and the complexities T of each 
function are all nonnegative and monotonic integer functions—both the input 
and the output are integers. We show soundness of the type system with respect 
to the resource model. The soundness theorem states that if we derive a bound 
O(w) for a function f, then the complexity of f is bounded by w. 


Theorem 1 (Soundness of type checking). Given a function fiz 
frAv1...ABn.t and an environment I, if T F fix f.rx,...Atn.t : (7, O(W)), 
then the complexity of f is bounded by w. 


Our type system is incomplete with respect to resource usage. That is, there 
are functions in our programming language that are actually in a complexity 
class O(p(x)), but cannot be typed in our type system. The main reason why 
our type system is incomplete is that it ignores condition guards when building 
recurrence relations, and over-approximates if-then-else terms by choosing the 
largest complexity among all the paths including even unreachable ones. 


4 The SyNPLEXITY Synthesis Algorithm 


In this section, we present the SYNPLEXITY synthesis algorithm, which uses 
annotated types to guide the search of terms of given types. 


4.1 Overview of the Synthesis Algorithm 


The algorithm takes as input a goal type f : (7,O(w)), an environment I’ that 
includes type information of auxiliary functions, and the size functions for f 
and all auxiliary functions. The goal is to find a function term of type (7, O(w)). 

The algorithm uses the rules of the SYNPLEXITY type system to decompose 
goal types into sub-goals, and then applies itself recursively on the sub-goals 
to synthesize sub-terms. Concretely, given a goal y, the algorithm tries all the 
typing rules, where the type in the conclusion matches y, to construct sub-goals: 
for each sub-term t in the conclusion, there must be a judgment I F t: y 
in the premise; thus, we construct the sub-goal y/—the desired type of t. For 
each I-term rule, the type of each sub-term is always known, and thus a fixed 
set of sub-goals is generated. For each E-term rule, the algorithm enumerates 
E-terms up to a certain depth (the depth can be given as a parameter or it 
can automatically increase throughout the search). If the algorithm fails to solve 
some sub-goal using some E-term rule, it backtracks to an earlier choice point, 
and tries another rule. 

Because the top-level goal is always a function type, the algorithm always 
starts by applying the rule T-ABs, which matches the resource bound O(w) using 
Table 1 to infer a possible recurrence annotation for the type of the function body. 
Also T-ABS constructs a sub-goal type for the function body. In the rest of this 
section, we assume that goals are not function types. 
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Algorithm 1: GENERATEE(I,7, d) 

Input : Context I’, goal type y = ({B | y}, a), depth bound d 
1 for t + ENUMERATEE(I,, d, B) do 
2 | if CHECKE(t, I, y) then return t 


3 return | 


Synthesizing E-Terms. The algorithm for synthesizing E-terms is shown in 
Algorithm 1. It enumerates each E-term t—with depth up to d—that satisfies 
the base type B in the goal y := ({B | v},([c1, 41].-[en, bn]; O(w))) from the 
context I’. For each such E-term t, the algorithm checks whether t satisfies the 
goal type with a subroutine CHECKE, which operates as follows. 

When t is a variable term, CHECKE checks the refined type of t against the 
goal. When t is an application term, CHECKE first checks if the total number of 
recursive calls in the term t exceeds the bound }°, c;, and if it does, the term t is 
rejected. Otherwise, CHECKE checks the sizes of sub-problems of recursive calls 
in t. Formally, to check if a recursive application term f(t1,..,tm) is consistent 
with some [ck, x], the algorithm queries the validity of the following predicate 


m 


(A lyi/ei= (sizes(y1 .. Ym) = [sizes (T (args))/v]ġr)), 


i=l 


where the y;’s are fresh variables, and the y;’s are the refinements of terms 
t;’s. If the sizes of sub-problems are not consistent with the recursive-call costs 
[c1, d1].-[Cn, dn], the term t is rejected. Note that one recursive call can possibly 
satisfy more than one [|ck, dx]. The algorithm enumerates all possible matches. 
Finally, CHECKE checks the refined type of t against the goal. 

Checking the validity of auxiliary application terms is similar. CHECKE needs 
to establish that the following predicate holds, which asserts that the resource 
usage of an auxiliary function does not exceed the bound O(w). 


m 


A i/dee> ([sizeg Y1--Ym/Uv_ € O([size I'(args)/v]2))) . 


i=l 


Recall that the above query is undecidable in general, and is checked with best 
effort by an SMT solver in SYNPLEXITY. 


Synthesizing I-Terms. Algorithm 2 shows the algorithm for synthesizing I- 
Terms. GENERATEI first tries to synthesize an E-term for the goal y (line (1)). 

If there is no E-term that satisfies the goal, and the match bound m is greater 
than 0, GENERATEI chooses to apply the rule T-Match lines (2)—(8). First, it 
enumerates candidate scrutinees s, which are E-terms of some data type. Then 
it generates match patterns according to the type of s (line (3)), updates the 
goal with a new recursive-call cost (line (4)), and generates case terms t; for 
each pattern pattern|t| (lines (5)—(7)). The subroutine UPDATECOST is used to 
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subtract the recursive-call cost usage from the cost in y. Finally, if all case terms 
are found, the algorithm constructs the corresponding match-term and returns 
it. 

If there is no match-term satisfying the goal, GENERATEI applies the rule 
T-IF to synthesize a term of the form if cond then tr else tr, and performs 
three steps to construct sub-goals for sub-terms cond, tr, and tp: (1) it enumer- 
ates the condition guard cond (line (10)) of type bool; (2) it updates the cost in 
the goal y (line (11)); and (3) it propagates sub-goals to the two branches tr and 
tr with cond and cond as the path condition (lines (12) and (13)), respectively. 
Finally, if both tr and tr are found, the algorithm constructs the corresponding 
if-term and returns it as a solution (line (14)). 


Optimization. Algorithm 2 discussed above is based on bidirectional type- 
guided synthesis with liquid types (SYNQUID [18]). Therefore, liquid abduction 
and match abduction, two optimizations used in SYNQUID, can also be used in 
SYNPLEXITY. These two techniques allow one to synthesize the branches of if- 
and match-terms, and then use logical abduction to infer the weakest assumption 
under which the branch fulfills the goal type. 


Algorithm 2: GENERATEI(I, y, d, m). 


Input : Context I’, goal type y, depth bound d, match bound m 
if t ~ GENERATEE(I,7,d) then return t 
if m > 0 then for s + ENUMERATEE(I, d, dataT ype) do 

patterns + GENERATEPATTERNS (I, TypeOf(s)) 

qy’ + UPDATECOST(s, 7) 

for i € [1, SIZE(patterns)| do 

ti + GENERATEI(UPDATECONTEXT(I, s == patterns{[i]), y’, d, m — 1) 
| if ti == l then return L 


Noa pep wn 


8 return Match s with |; patterns[i] > ti 


10 for cond ~ ENUMERATEE(T, d, Bool) do 


11 qy’ + UPDATECOST(s, 7) 
12 tr < GENERATEI(UPDATECONTEXT(I, cond), y’, d, m) 
13 tr < GENERATEI(UPDATECONTEXT(I, acond), y’, d, m) 


14 if tr A L Atr # L then return If cond then tr else tr 


15 return | 


Example 7. We illustrate in Fig. 4 how the algorithm synthesizes the O(log x) 
implementation of prod presented in Eq. (2). We omit the type contexts in the 
example. We will use “??” to denote intermediate terms being synthesized (i.e., 
holes in the program). At the beginning, the type of 77, (i.e., the term we are 
synthesizing) is an arrow type with resource bound O(log u) specified by the 
input goal. In this example, SYNPLEXITY applies to the arrow type the rule 
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prod=??, :(x:{Int | v > 0}—>y:{Int | v>0}— {Int | v= x * y}, (O(log u))) 


prod=A\x.Ay.if ??3 —— x==0:({Bool | x = 0}, ( 

then ?7?4:({Int | v=x*yAx = 0, ([1, |$]];O(1))) 
E-A. u 

< x: ({Int | v= 0}, ((0, L3]]; 00) 
else 7?5:({Int | v=x*y^Ax>0,([, |4]; 00 
ee aA E e ae ca 
??5 —> if ??6 <—— even x:({Bool | x mod 2 = 0}, ([0, | * 
then ??7:({Int | v=x*xyAx mod 2 = 0, ([1, |2|]; 
oA” double (prod (div2 x) y) 
else ??9:({Int | v=x*yAx mod 2=1,([1,|4]];O(1))) 

eo“ plus y (double (prod (div2 x) y)) 


Fig. 4. Trace of the synthesis of an O(log x) implementation of prod. 


T-ABS, parameterized according to the first rule in Table 1. This step produces 
the sub-problem of synthesizing the function body ??2, whose annotation is 
({1, | $ J]; O(1))—which means that ??2 should contain at most one recursive call 
to sub-problems with size | 5]. 

Next, SYNPLEXITY chooses to fill ??ə with an if-then-else term (by 
applying the T-IF rules) with three sub-problems: the condition guard ?73, the 
then branch ??4 and the else branch ??;. Note that here we share the num- 
ber of recursive calls [1, $] as follows: 0 recursive calls in the condition guard, 
and 1 in the then branch and the else branch. The left arrow E-App shows 
how SYNPLEXITY enumerates terms and checks them against the goal types of 
sub-problems. For example, to fill ??4, SYNPLEXITY enumerates terms of type 
({Int | v=x*yA^x = 0, ([1, 3]; O(1))) , which are restricted to contain at most 
one recursive call to prod. In Fig. 4, SYNPLEXITY has picked the term x to fill 
??4. The refinement type of the variable term x is {Int | v = x A x = 0} where 
x = 0 is the path condition. To check that x also satisfies the type of ??4, the 
algorithm needs to apply rule E-SUBTYPE, and check that, for any v and x, 
v=xAx=0 implies v= x * y A x = 0, and (0, | 5 |] is approximated by [1, | 5]]. 

After applying another T-IF rule for ??5, SYNPLEXITY produces three 
new sub-problems ??5, ??7, and ??g. When enumerating terms to fill ??7, 
SYNPLEXITY finds an application term double (prod (div2 x) y) that sat- 
isfies the goal ({Int | v = x* yAx mod 2 = Q,([1,|3]];O(1))) . To 
check that the size of the problem in the recursive call prod (div2 x) 
y satisfies the recursive-call cost [1,|$|], the type system first checks the 
refinement of the callee. The refinement of the first argument (div2 x) is 
pı := v = |%]. The refinement of the second argument y is p2 := v = y. Conse- 
quently, the size of the sub-problem prod (div2 x) y satisfies [1,|5]|] because 
[z/u)p1 A [w/ p2 = size z w= [(size x y)/v|| 5], which can be simplified 
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to z =|] ^Aw=y = z= |]. (Recall that the size function for prod is 
size := Az.Aw.z.) 


The algorithm is sound because it only enumerates well-typed terms. 


Theorem 2 (Soundness of the synthesis algorithm). Given a goal type 
(7,O(w)) and an environment I, if a term fix f.\x1..A£n.t is synthesized by 
SYNPLEXITY, then the complexity of f is bounded by w. 


5 Extensions to the SYNPLEXITY Type System 


In this section, we introduce two extensions to the SYNPLEXITY type system. 


Recurrence Relations with Correlated Sizes. The type system shown in 
Sect. 3 only tracks sub-problems with independent sizes. For example, consider 
the recurrence relation T(u) = T(l) + T(r) + O(1), where the variables | and 
r are correlated by the constraint l +r < u. This relation is needed to rea- 
son about programs that manipulate binary trees or binary heaps, where l and 
r represent the sizes of the two children. To support such a recurrence rela- 
tion, we extend SYNPLEXITY’s type system with recursive-call costs of the form 
(1, 7], [1,u— 1 — l], where l is a free variable. When correlated recurrence rela- 
tions are present, the synthesis algorithm will: (1) match the first enumerated 
recursive-call term to [1,/], and instantiate the size | with s, where s is the size 
of the recursive-call term (s should be smaller than the size u of the top-level 
function); and (2) use the size s of the recursive-call term computed in step 1 to 
constrain the algorithm to enumerate only recursive-call terms of sizes at most 
u—l-—s. 


Synthesis of Auxiliary Functions. Most of the existing type-directed 
approaches require the input to the problem to contain all needed auxiliary 
functions. With SYNPLEXITY, some of the auxiliary functions needed to solve 
synthesis problems with resource annotations can be synthesized automatically. 
For example, consider the problem prod described in Sect. 2. In this problem, 
we observe that one of the provided auxiliary functions, div2, strongly resembles 
one of the elements of the recurrence relation, T(u) < T(|5]) + O(1), needed to 
synthesize a program with the desired resource usage. In particular, we know that 
one needs an auxiliary function that can take an input of size u and produce an 
output of size |4]. In this example, the required auxiliary function div2 merely 
needs to divide the input by 2 (and round down), but in certain cases it might 
need a more precise refinement type than merely changing the size of the input. 
For example, the auxiliary function split used by merge sort needs to split the 
input list xs into two lists v1 and v2 that are half the length of the input and 
such that elems(v1) W elems(v2) = elems(xs). However, all we know from the 
refinement is that the output lists must be half the length of the original list. 
Although we do not know what this auxiliary function should do exactly, we 
can use the size constraint appearing in the recurrence relation to define part 
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of the refinement type we want the auxiliary function to satisfy. SyYNPLEX- 
ITY builds on this idea and incorporates an (optionally enabled) algorithm, 
SYNAUXREF, that while trying to synthesize a solution to the top-level syn- 
thesis problem also tries in parallel to synthesize auxiliary functions that can 
create sub-problems with the size constraints needed in the recurrence relation. 
To address the problem mentioned above—i.e., that we do not know the exact 
refinement type the auxiliary function should satisfy—SyNAUXREF enumerates 
auxiliary refinements, which are possible specifications that the auxiliary func- 
tion aux we are trying to synthesize might satisfy. 


Synthesis with Higher-Order Functions. Although SYNPLEXITY does not 
support higher-order functions in general, it can solve restricted but practical 
problems with higher-order functions. The restriction supported introduces four 
assumptions on the synthesis problems. First, we assume that the resource usage 
of any function argument g is constant, i.e., g : (T, O(1)}. Second, arrow-type 
arguments in recursive calls in the synthesized program are the same as the 
arrow-type arguments of the top-level function. For example, in the body of a 
higher-order function fix f.AgAwvAy.t, all recursive application terms must be of 
the form f(g, -,-) where _ can be any well-typed term. Third, we assume that the 
sizes of outputs of functions as arguments do not affect the asymptotic resource 
usage of the synthesized programs. Finally, arrow-type arguments cannot appear 
in size functions. 

We extend the syntax and the type system of SYNPLEXITY to support the 
restricted problems (the detail of this extension can be found in the technical 
report [11]). We also modify the synthesis algorithm to prune E-terms that break 
the second or third restriction mentioned above. 

To support the second restriction (i.e., that we need to call the same function 
arguments in recursive calls), the synthesis algorithm first stores the function 
arguments of the top-level functions. Later, when a recursive call is enumer- 
ated, the synthesizer checks whether the recursive call has the same function 
arguments, and rejects the candidate if it does not. 

To support the third restriction (i.e., that the behavior of function arguments 
should not affect the resource usage), the synthesis algorithm avoids enumerat- 
ing nested application terms where the resource usage of the outer application 
depends on the value of an inner application term that calls a function argument. 


6 Evaluation 


In this section, we evaluate the effectiveness and performance of SYNPLEXITY, 
and compare it to existing tools.4 We implemented SYNPLEXITY in Haskell on 
top of SYNQUID by extending its type system with recurrence annotations as 
presented in Sect. 3. The detailed results can be found in the technical report 
[11]. 


4 All the experiments were performed on an Intel Core i7 4.00 GHz CPU, with 8 GB 
of RAM. We used version 4.8.9 of Z3. The timeout for each benchmark was 10 min. 
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6.1 Comparison to Prior Tools 


We compared SYNPLEXITY against two related tools: SYNQUID [18] and RESYN 
[16], which are also based on refinement types. 


Benchmarks. We considered a total of 77 synthesis problems: 56 synthesis 
problems from RESYN (each benchmark specifies a concrete linear-time resource 
annotation), 16 synthesis problems from SYNQUID (which do not include resource 
annotations) that are not included in RESYN, and 5 new synthesis problems 
involving non-linear resource annotations. In these synthesis problems, synthesis 
specifications and auxiliary functions are all given as refinement types. For 3 
of the new benchmarks, the auxiliary function required to split the input into 
smaller ones is not given—i.e., the synthesizer needs to identify it automatically. 

The three solvers (SYNPLEXITY, SYNQUID, and RESYN) have different fea- 
tures, and hence not all synthesis problems can be encoded as synthesis bench- 
marks for a single solver. In the rest of this section, we describe what benchmarks 
we considered for each tool, and how we modified the benchmarks when needed. 


SYNQUID: SYNQUID does not support resource bounds, so we encoded 77 syn- 
thesis problems as SYNQUID benchmarks by dropping the resource annotations. 
SYNQUID returns the first program that meet the synthesis specification, and can- 
not provide any guarantees about the resource usage of the returned program. 
SYNQUID can solve 75 benchmarks, and takes on average 3.3s. For 10 benchmarks 
SYNQUID synthesizes a non-optimal program—i.e., there exists another program 
with better concrete resource usage. For example, on the RESYN-triple-2 bench- 
mark (where the input is a list xs), SYNQUID found a solution with resource usage 
O(\as|7), while both SYNPLEXITY and RESYN can synthesize a more efficient 
implementation with resource usage O(|xs|). The two benchmarks that SYN- 
QUID failed to solve include the new benchmark SYNPLEXITyY-merge-sort’. In 
this benchmark, the auxiliary function required to break the input into smaller 
inputs is not given, without which the sizes of solutions become much larger. 
Therefore SYNQUID times out. 


RESYN: We ran RESYN on the 56 RESYN benchmarks with the corresponding 
concrete resource bounds. We could not encode 16 problems because RESYN does 
not support non-linear resources bounds—e.g., the bound log |y| in the AVL- 
insert SYNQUID benchmark. RESYN solved all 56 benchmarks with an average 
running time of 18.3s. 


SYNPLEXITY: We manually added resource usages and resource bounds to exist- 
ing problems to encode them for SYNPLEXITyY. For SYNQUID benchmarks with- 
out concrete resource bounds, we chose well-known time complexities as the 
bounds, e.g., we added the resource bound O(ulog u) to the Sort-merge-sort 
problem. For the RESYN benchmarks, we translated the concrete resource usage 
and resource bounds to the corresponding asymptotic ones—e.g., for the RESYN- 
common’ benchmark with the concrete resource bound |ys|+|zs|, we constructed 
a SYNPLEXITY variant with the asymptotic bound O(u) and a size function 
Ays.Az8.|ys| + |zs|. We could not encode 3 synthesis problems as SYNPLEXITY 
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benchmarks: two of them involved higher-order functions that do not satisfy 
the assumptions introduced in Sect.5, and the other one has an exponential 
resource-usage bound O(2”) (the Tree-create-balanced problem from SYNQUID). 

SYNPLEXITY solved 73 benchmarks with an average running time of 8.1s. 
Unlike SYNQUID, SYNPLEXITY guarantees that the synthesized program sat- 
isfies the given resource bounds. After extending the implementation to sup- 
port the restrictions discussed in Sect.5, SYNPLEXITY solved 5/6 benchmarks 
with higher-order functions. For 10 benchmarks, SYNPLEXITy found programs 
that had better resource usage than those synthesized by SYNQUID. Further- 
more, SYNPLEXITY can encode and solve 9 problems that RESYN could not 
solve because the resource bounds involve logarithms. However, SYNPLEXITY 
cannot encode and solve 2 benchmarks that involve higher-order functions and 
do not satisfy the restrictions introduced in Sect.5. SYNPLEXITY could solve 3 
problems that required synthesizing both the main function (e.g., SYNPLEXITY- 
merge-sort) and its auxiliary function (e.g., a function splitting a given list into 
two balanced partitions). No other tool could solve the SYNPLEXITY-merge-sort’ 
benchmark. 


Finding. SYNPLEXITY can express and solve 73/77 benchmarks. SYNPLEX- 
ITY has comparable performance to existing tools, and can synthesize programs 
with resource bounds that are not supported by prior tools. 


6.2 Pruning the Search Space with Annotated Types 


SYNPLEXITY uses recurrence annotations to guide the search and avoids enu- 
merating terms that are guaranteed to not match the specified complexity. We 
compared the numbers of E-terms enumerated by SYNPLEXITY and SYNQUID 
for 56 benchmark on which both tool produced same solutions. SYNQUID always 
enumerated at least as many E-terms as SYNPLEXITY, and SYNPLEXITY enu- 
merated strictly fewer E-terms for 26/56 benchmarks. For these 26 benchmarks, 
SYNPLEXITY can on average prune the search space by 6.2%. For example, in 
one case (BST-delete) SYNPLEXITY enumerated 2,059 E-terms, while SYNQUID 
enumerated 2,202. 


Finding. On average, SYNPLEXITY reduces the size of the search space by 
6.2% for approximately half of the benchmarks. 


7 Related Work 


Resource-Bound Analysis. Rather than determining whether a given pro- 
gram satisfies a specification, a synthesizer determines whether there exists a 
program that inhabits a given specification. The branch of verification that we 
draw upon for resource-based synthesis is resource-bound analysis [20]. 

Within the literature on automated resource-bound analysis, there are meth- 
ods that extract and solve recurrence relations for imperative code [2,4,7, 15]. 
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However, these methods are unlike the type system presented in this work 
because they extract concrete complexity bounds as recurrence relations, and 
then solve the recurrences to find a concrete upper bound on resource usage. 
The dominant terms of the resulting concrete bounds can then be used to state 
a big-O complexity bound. In contrast, we want to synthesize programs with 
respect to a big-O complexity directly, which is more similar to the manual rea- 
soning of [6,8]. Thus, if we were to use these techniques for our problem, the 
first step in our synthesis algorithm would be to pick a concrete complexity 
function given a big-O complexity, and then reverse the verification problem 
with regards to that concrete complexity. However, for any big-O complexity, 
there are an infinite number of functions that satisfy that complexity, which 
presents a significant challenge at the outset. Our design choice also has some 
drawbacks. As noted in [8], reasoning compositionally with big-O complexity is 
challenging due to the hidden quantifier structure of big-O notation. Thus, to 
maintain soundness our type system has to sacrifice precision and generality in 
some places. For example, when a function has multiple paths, our type system 
over-approximates by choosing the largest complexity among all the paths. 

Another set of methods to generate resource bounds are type-based [9, 10, 14, 
19]. As we discussed throughout the paper, the complexities generated by these 
methods are concrete functions and not expressed with big-O notation, although 
[19] is sometimes able to pattern match a case of the Master Theorem. These 
type systems differ from ours in a few ways. The AARA line of research [9, 10, 14] 
is able to assign amortized complexity to programs, but is not able to generate 
logarithmic bounds. [19] is also able to perform amortized analysis; however, the 
technique is not fully automated, and instead requires the user to provide type 
annotations on terms, which are then checked by the type system. 


Type- and Resource-Aware Synthesis. The SYNPLEXITY implementation is 
built on top of SYNQUID [18] a type-directed synthesis tool based on refinement 
types and polymorphism. The work that most closely resembles ours is RESYN 
[16]. As in our work, they combine the type-directed synthesizer SYNQUID with 
a type system that is able to assign complexity bounds to functional programs. 
The type system used in RESYN is based on one originally used in the context of 
verification [10]. That work uses a sophisticated type system to assign amortized 
resource-usage bounds to a given program. The type system of RESYN differs 
from the one presented in Sect.3 in a few significant ways. 

As highlighted earlier, RESYN automatically infers bounds on recursive func- 
tions using amortized analysis and is restricted to linear bounds, whereas our 
system is able to synthesize complexities of the form O(n“ log? n + c). 

Another difference is that RESYN synthesizes programs with a concrete com- 
plexity bound. This approach has advantages and disadvantages. For instance, it 
places an extra burden on the human to provide the correct bound with precise 
coefficient. On the other hand, the user might want an implementation that has 
a complexity with a small coefficient, whereas our system provides no guarantee 
that the complexity of an implementation will have a small coefficient in the 
dominant term: SYNPLEXITY only guarantees asymptotic behavior. 
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RESYN can synthesize programs with higher-order functions, which are sup- 
ported only in a restricted manner by SYNPLEXITy. To handle higher-order 
functions, RESYN attaches resource units to types, which gives it resource poly- 
morphism. Moreover, costs of inputs with function types can be written generally 
as polymorphic types (i.e., costs can be polymorphic with respect to the size of 
the specific input types). SYNPLEXITY does not have asymptotic resource poly- 
morphism because it cannot directly compose unknown big-O functions (i.e., 
the complexity of higher-order inputs). We envision that with carefully crafted 
restrictions on the resource annotations of higher-order functions, SYNPLEXITY 
could handle synthesis problems involving such functions, e.g., assuming that the 
complexity of input functions is known and the refinements of input functions 
are precise enough. Detailed discussion about these restrictions can be found in 
Sect. 5 and the technical report [11]. Because big-O functions cannot be directly 
composed, developing a more general extension to SYNPLEXITY that supports 
higher-order functions is a challenging direction for future work. 
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Abstract. Sketch is a popular program synthesis tool that solves for 
unknowns in a sketch or partial program. However, while Sketch is pow- 
erful, it does not directly support modular synthesis of dependencies, 
potentially limiting scalability. In this paper, we introduce Sketcham, 
a new technique that modularizes a regular sketch by automatically 
generating mocks—functions that approximate the behavior of complete 
implementations—from the sketch’s test suite. For example, if the func- 
tion f originally calls g, Sketcham creates a mock gm from g’s tests and 
augments the sketch with a version of f that calls gm. This change allows 
the unknowns in f and g to be solved separately, enabling modular syn- 
thesis with no extra work from the Sketch user. We evaluated Sketcham 
on ten benchmarks, performing enough runs to show at a 95% confidence 
level that Sketcham improves median synthesis performance on six of our 
ten benchmarks by a factor of up to 5x compared to plain Sketch, in- 
cluding one benchmark that times out on Sketch, while exhibiting similar 
performance on the remaining four. Our results show that Sketcham can 
achieve modular synthesis by automatically generating mocks from tests. 


Keywords: Program synthesis, mocks, Sketch 


1 Introduction 


Program synthesis by sketching, as embodied by the Sketch synthesis tool [80], 
is a popular technique that has been applied to a wide variety of problems 
[E7314] 1 5/1 6]18]22)29]. A Sketch input (henceforth a sketch) is a program 
written in a C-like language augmented with holes, unknown constants, and gen- 
erators, unknown expressions. The solution for a sketch is specified using test 
cases called harnesses, also written in the Sketch language, that make assertions 
about the results of to-be-synthesized code. Sketch searches for a solution using 
countererample-guided inductive synthesis (CEGIS), which alternately synthe- 
sizes a candidate solution and then uses a verifier to check the assertions; any 
counterexamples from verification feed into the next round of synthesis [27]. 


© The Author(s) 2021 
A. Silva and K. R. M. Leino (Eds.): CAV 2021, LNCS 12759, pp. 808-831, 2021. 
https: / /doi.org/10.1007/978-3-030-81685-8_38 
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One key challenge of using Sketch is that it does not specifically support 
modular synthesis. More precisely, even if an input sketch is divided into a num- 
ber of functions that call each other, Sketch solves them all together. This ap- 
proach potentially limits scalability, as SAT formulas created by Sketch can grow 
quite quickly as function calls are inlined. A Sketch user could potentially work 
around this by manually replacing calls to to-be-synthesized functions with calls 
to Sketch models [24], which are mocks, i.e., functions that, in place of full imple- 
mentations, approximate the desired behavior with a specification in the form 
of assertions about individual cases. However, writing additional specifications 
is both time consuming and redundant with developing the original sketch. 


In this paper, we introduce Sketcham (short for Sketch and Mocks), a novel 
technique that converts a regular sketch problem into a modular sketch problem 
by automatically generating mocks from harnesses. More specifically, suppose 
Sketcham is given a sketch in which function f calls g and g is tested by harness h. 
Sketcham first converts h into a mock gm that has the same function signature as 
g but whose body encodes the assertions from h. Then, Sketcham augments the 
original sketch with new code in which f calls gm instead of g, thereby allowing 
f to be synthesized separately from g. Thus, by converting tests (harnesses) to 
mocks (specs), Sketcham enables modular synthesis without extra work from the 
user. Section 2] gives an overview of Sketcham. 


Sketcham generates the new, modular sketch problem using a sequence of 
three algorithms. First, Sketcham traverses the original sketch to build a map- 
ping A from function names to a set of assertions in which each function is called. 
Note that we place some limitations of the assertions—e.g., they can contain at 
most one function call—to guarantee we can always translate them from harness 
assertions to mock assertions. Next, Sketcham traverses A, generating a mock 
fim for each function f € dom( A), where fm encodes the assertions in A(f). Fi- 
nally, Sketcham generates new mock harnesses that are the same as the original 
harnesses, except they call mocks instead of the underlying functions. Section 
presents Sketcham’s core algorithms. 


We implemented Sketcham as an additional pass to Sketch, which we evalu- 
ated on ten benchmarks. We found a high variance in running time, both under 
Sketch and under Sketcham. To account for this difference, we used the Clopper- 
Pearson method [6], running each configuration (synthesis tool-benchmark com- 
bination) up to 1,487 times, reaching 95% confidence that the true median run- 
ning time lies within 20% of the experimental median, excluding failures and 
runs exceeding a 60 minute timeout. We found that, for six of ten benchmarks, 
Sketcham runs up to 5x faster than Sketch; for one benchmark Sketcham is up 
to a factor of 0.98x slower; for the remaining three benchmarks, performance is 
indistinguishable. We examined one benchmark, deduplication of elements in an 
array, in detail. We found that the performance improvement is largely due to 
a mock that does a thorough job representing the function it mocks, and that 
the performance improvement occurs during the CEGIS synthesis phase rather 
than the CEGIS verification phase. Section [4] presents our evaluation. 
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1 int[n] dedup(int n, int[n] vs, 


2 ref int sz) { 

3 int[n] svs=sort(n, vs); int[n] res; 

4 sz = ??; // 0 

5 for(int i=??; i<n; ++i) { // 0 

6 int j = expr({sz,i}, {PL,MI}); // sz-1 

7 if(...){//sz==0//sus[i]>res[7] 

8 res[sz] = svs[il; 

. ne SE Sens E Sere (b) Original harnesses. 
11 return res; 

12 } 


13 int[n] sort(int n, int[n] vs) { 


wu | int m=..., r=..., i=..., j=...; 
15 int[m] as = sort(vs[0::m]); 

16 int[r] bs = sort(vs[m::r]); 

17 while(exprBool({i, j, n}, {PL})) // i+j<n 
18 /* add as[i++] or bs[j++] to us */ 

19 return vs; 

20 } 


(c) Mock harnesses. 


(a) dedup and sort (simplified). 


1 harness void h_sort(int n, 
z int[n] vs) { 
3 int[n] svs = sort(vs); 
4 for(int i=0; i<n-1; ++i) 
: ; D> 
5 assert svs[i] <= svs[i+1]; 
6 /* also elts(vs)=elts(sus) */ 
T 
8 } 


model int[n] sort_mock(int n, 
int[n] vs) { 
int[n] svs = sort_uf(vs); 
for(int i=0; i<n-1; ++i) 
assume svs[i] <= svs[i+1]; 
/* also elts(vs)=elts(sus) */ 
return svS; 


} 


(d) Translating sort’s test harness into a mock. 


Fig. 1: Sketcham applied to deduplication via sorting. 


In summary, Sketcham demonstrates that modular synthesis can be achieved 
by automatically generating mocks from tests (specs from harnesses) without 


additional user effort. 


2 Overview 


To illustrate Sketcham, consider Figure [Ia] which shows a simplified sketch 
whose solution deduplicates an array of integers. This sketch makes use of Sketch 


o n on A 


holes ??, which are unknown constants, and generators such as expr (vars, ops), 


which is an unknown expression composed of variables vars combined with 
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operands ops, including PL for addition and MI for subtraction. The correct 
solutions for the holes and generators are shown in end-of-line comments. 

At the top of Figure[la| function dedup takes a length n and array vs, and it 
returns the deduplicated array and, by reference, the deduplicated array’s length 
sz (in Sketch, functions can only have at most one return value, hence the return- 
by-reference sz). The dedup function begins by calling another function, sort, 
to sort the array (line [3). Then it initializes sz to a hole and loops through 
the array (lines [4]5). In each iteration, it computes an expression j of sz and i 
(line (6) used in a conditional guard (line |7} details of guard not shown). If the 
condition holds, the element at position i is copied into res and sz is updated; 
otherwise the element is ignored. Finally, dedup returns the result array res. 

The sort function (line|13) takes the length and array and returns a sorted 
array. This particular sketch is for merge sort. Here the programmer knows 
that merge sort involves sorting two sub-arrays but isn’t sure about the details. 
After some initialization (not shown), it makes two recursive calls to sort sub- 
arrays (lines [15] and [16). Then it loops over the sorted sub-arrays, merging the 
elements into array vs, which is returned. The loop guard (line|17) uses a different 
generator, exprBool(vars, ops), that generates arithmetic comparisons (<, <=, 
etc) among expressions generated by calling expr(vars, ops). 


Harnesses and Mocks. To test the expected behavior of dedup and sort, the 
sketch also includes two harnesses, h_dedup and h_sort. Figure [Ib] shows the 
call graph of the sketch with the harnesses, and the left side of Figure [Id] shows 
a portion of h_sort (we omit h_dedup for brevity). This harness calls sort and 
then makes assertions about the results, e.g., that the output array is sorted. 
Harnesses are distinguished from regular functions by the keyword harness, and 
their arguments are treated as universally quantified. Thus, h_sort tests that 
for all n and arrays vs of length n, the sort function is correct. 

To solve this synthesis problem, Sketch converts dedup, sort, and a har- 
ness into a single SAT formula and then uses CEGIS to find a solution. This 
approach works, but the formula passed to the solver is large, because it con- 
tains both functions’ worth of code, and complex, because reasoning about the 
code in dedup requires simultaneously reasoning about the code in sort. Thus, 
mashing together both functions into a single SAT formula potentially limits the 
scalability of Sketch. 

The key idea of Sketcham is to observe that this sketch is actually modular— 
it has been divided into two functions, each with their own tests. Sketcham takes 
advantage of this modularity by creating a new synthesis problem that includes 
mock versions of functions in the sketch, which can then be used to enable 
separate reasoning about each function. 

The right side of Figure shows sort_mock, the mock version of sort. 
The mock has the same signature as sort, but instead of containing the actual 
sorting code, it contains assertions from h_sort about sort’s expected behavior. 
In detail, in place of calling sort, the mock calls a fresh uninterpreted function 
sort_uf on line |3| Then it makes assumptions (rather than assertions) about 
the result array svs (line[5), and finally returns svs (line|7). The mock itself is a 
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int doub(int m) { model int doub_mock(int m) { 
return m * 2; int out = doub_uf(m) ; 
} assume (0 == m/10) => 
harness void h(int n) { out == (m/10 + m/10) * 77; 
int out = doub(n * 10); return out; 
assert out == (n + n) * ??; 
} } 
(a) Double. (b) Mock double. 


Fig. 2: The double function and its mock. 


Sketch model (indicated by the model keyword), and where the mock is called, 
Sketch will replace the call with the assumptions in the model’s body [24]. 

Next, Sketcham creates new code that uses the mock, as shown in Figure [Ic] 
(Here the dashed, greyed boxes are for functions and harnesses that are generated 
but do not improve solving time; see Section [4.2]) In particular, dedup' is the 
same as dedup, except it calls sort_mock instead of sort, and h_dedup'' is the 
same as h_dedup but it calls dedup' instead of dedup. 

The final sketch includes h_dedup'', h_dedup' (a trivial harness that calls a 
mocked dedup), and h_dedup—in that order—as well as the harnesses for sort. 
Sketcham searches for a solution for each harness in order, i.e., it tries to solve 
h_dedup'' first. Notice that, critically, when Sketcham solves h_dedup'', it need 
not consider the code of sort, but rather only its specification as encoded in the 
mock. In practice, this means that Sketcham can solve h_dedup'' up to 18.1x 
faster than Sketch solves h_dedup, a significant speedup. 

Moreover, sort_mock encodes the specification of sort, so once Sketcham 
solves h_dedup'', it has found a solution for h_dedup as well. To preserve cor- 
rectness, Sketcham keeps the original harnesses such as h_dedup, because mocks 
with partial specifications can lead to partially incorrect solutions to the har- 
nesses using them. However, even in these cases, the counterexamples they gener- 
ate can still help more quickly narrow the synthesis search space for the original 
harness, and lead to an ultimately valid solution. 


Quantifier Elimination. In Figure the translation from harness to mock 
was straightforward: the call to the mocked function becomes a call to an un- 
interpreted function, and asserts become assumes. Sometimes, however, the 
translation is more complex. Consider the sketch in Figure which includes a 
function doub that doubles its input and a harness h that calls doub(n*10) and 
asserts the result is (n+n)*?? for some hole. 

Notice this assertion only describes arguments of the form n*10 for some n, 
i.e., implicitly there must exist some m such that m = n*10 for the assertion to 
hold. Sketcham performs quantifier elimination on such nested existentials, 
following the approach of Kuncak et al. [I7]. Figure [2b]shows the resulting mock. 
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Input Output 
Program Sketch Frontend Program 


Sketch IR 


Hole assignment 


Sketch Sketcham 
Backend 
(BULDASSERTMAT]) 
Parse and | P A Pm 
nan >| x CEGIS 
Optimize ((GenzRaTEMocks]) 
p, F 
(Mock HARNEssSES]) 


Fig. 3: Sketcham architecture 


Here, in the assumption, n is replaced by witness candidate m/10. Because m is 
an integer, we also add a precondition that m is evenly divisible by 10. 

We note that Sketcham includes quantifier elimination for completeness, and 
in our evaluation we consider the sketch in Figure However, we did not find 
quantifier elimination necessary for our other benchmarks. 


3 The Sketcham Algorithm 


Next we more formally describe Sketcham, which is implemented as a pass within 
Sketch as shown in Figure |3| The presentation that follows reflects this Sketch 
implementation without loss of generality of the core algorithm for converting 
tests to mocks. The Sketch frontend consumes the input sketch and transforms 
it into the Sketch intermediate representation (Sketch IR), which is passed to the 
Sketch backend. Sketch IR encodes first-order logic augmented with theories of 
arithmetic, arrays, functions, and more, as discussed below. When the backend 
loads the IR, it performs loop unrolling, function inlining, and other transforma- 
tions that are needed by the solver [26], yielding a program p. Standard Sketch 
then uses CEGIS to solve the synthesis problem, outputting a hole assignment 
that the frontend uses to produce the solved sketch. Sketcham modifies this pro- 
cess by inserting, after optimization, a mock rewriting phase, described below, 
that transforms p into the augmented program pm for CEGIS. 

We formalize Sketcham on the fragment of Sketch IR shown in Figure f] Here 
types are omitted, and we assume the sketch is type-correct. A program sketch 
p is a sequence of harness and function definitions. A harness definition h tags 
a function definition as a test harness. A function definition d is given named 
parameter}| and a body, which is a sequence of statements. Statements s are 


1 For simplicity, we assume parameter names are unique across the whole program. 
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p == (h| dy" 
h := harness d 
d n= def f(a,...,2){s*} 
s == £ := e | return e | assert ¢ | assume @ 
e, ġ, Y ::= f(e,...,e)|uopelebope|n|a|??a 
uop :==]|- 
bop z= Al|V|@| =e" | 71% 
x,y E variable names G € graphs A:f>S 
f, g € function names @ €E set of ġ F:f>f 


Fig. 4: Sketcham’s fragment of Sketch IR 


assignments, returns, assertions, and assumptions. The most critical expressions 
e in our algorithm are function calls f ( e ,..., e ) with their arguments. The 
detailed grammar for the remaining expressions is unimportant in the remainder 
of this section, but for completeness we show expressions for unary and binary 
logical and arithmetic operations uop e and e bop e; constants n; variables z; 
and named holes ?? x. Below, we sometimes use the metavariables ọ and w in 
place of e to indicate an expression used for Boolean-valued formulas. 

Given the input Sketch IR program p as shown in Figure [3] Sketcham creates 
the output sketch by first calling BUILDASSERTMAP (Algorithm to build 
mapping A from function names to assertions from tests of those functions. Next, 
GENERATEMOCKS (Algorithm 2) uses A to construct mocks for functions in the 
domain of A, yielding program p’, which includes the original sketch p plus those 
mocks. This step also returns a mapping F from the original function names to 
the corresponding mock names. Finally, MOCKHARNESSES (Algorithm|3) creates 
the output sketch pm, which augments p’ with copies of the original sketch’s 
harnesses, except the copies call the mocks instead of the original functions. 

Critically, during this last step, holes are not renamed when the harnesses are 
copied. Moreover, the newly generated harnesses are prepended to the sketch. 
Thus, when CEGIS tries solving each harness in pm in order, it will first find 
solutions that are consistent with the mocks. Then when it reaches the original 
harnesses (which must remain in case there is information in them not captured 
by the mocks—see discussion of GENERATEMOCKkS below), CEGIS can use the 
information it already derived from the mocks to find the ultimate solution to 
the original problem. 

In the remainder of this section, we describe each step of the algorithm in 
detail. Below, we capitalize the names of sets of a given metavariable (e.g., & is 
a set of formulas ¢, etc.), and we use vector notation to indicate arrays (e.g., & 
is an array of statements s). 


Building the assertion mapping. Each mock expresses the specification of an 
original function as it is encoded by that function’s tests. To start, Sketcham 
collects assertions from those tests into an assertion mapping. Algorithm [I]builds 
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Algorithm 1 Mock rewriting: building the assertion map 

Input: p - the sketch 

Output: A - finite map of function names to sets of assert formulas 
1: function BUILDASSERTMAP(p) 


2 Ac? 

3 $ + {¢ġ | assert ¢ € p} > all solver-reachable asserts in p 
4: o + {pE G|0=|f(...) Epl} > asserts with 0 function calls 
5: &,¢+ {gE @|1=|f(...) Epl} > asserts with 1 function call 
6: for all f € &; do 

7 $s —PU{PEehi| feo} > asserts with 0 calls, or 1 call to f 
8: ZREN 

9: while Y 4 ý do 

10: X+FV(W) > inputs and holes free in Y 
11: We {hE bs | XNFV(d) #0} 

12: b; Br \ Ü 

13: end while 

14: Alf] = 8r 

15: end for 


16: end function 


the assertion mapping A from the input sketch p. The algorithm begins by 
initializing A to empty and ® to the set of all assertions from all tests in p. 
It then selects two subsets of &. The set o contains all assertions that do not 
include calls to any functions, and the set ©; contains all assertions that include 
exactly one function call. We exclude assertions with multiple function calls 
so that mocks are standalone, to conform to the technical requirements Sketch 
imposes on models. As a consequence, we exclude some terms that present no 
such concerns (e.g., conjunctions of otherwise unrelated terms), as translating 
them to assumptions may be much more complex or even impossible. We leave 
extending BUILDASSERTMAP to more assertion patterns to future work. 

For each function f called in an assertion in $4, on line[7]we next compute the 
set Ps from o (the assertions that hold throughout each test, including at calls 
to f) and the subset of ©; that refers to f. For example, consider the assertion 
in h_sort in Figure [Id] This code refers to the result of calling sort(n, vs), 
so ®; = {¢;(sort(n,vs))}, where the ¢;s capture the assertions in h_sort. 
Additionally, if we picked, say, a loop unrolling bound of 4, then Sketch would 
implicitly assert n<4, resulting in o = {n<4}. In general, Po might contain 
additional assertions that are irrelevant to the calls in @,. For example, loop 
unrolling for harness h_dedup (not shown) might add another bound m<4 to Do 
for sort. However, such irrelevant assertions will not change the resulting mock. 

In some cases, we cannot add assertions in ®¢ to A because other asser- 
tions on the same variables interfere. For example, suppose the sketch includes 
assert f(x) and assert g(x). Then f might not completely characterize 
f—the assertion in ®ş is valid only if assert g(x) also holds, which puts an 
unknown (until the full sketch is solved) constraint on x. Thus, in this case, our 
algorithm discards the assertions in ®ş. More specifically, on line gi the loop 
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Algorithm 2 Mock rewriting: generate mocks 

Input: p - the sketch 

Input: A - output of Algorithm [I] 

Output: p’ - the sketch augmented with mock definitions 

Output: F - finite mapping from an original function name to its mock 
1: function GENERATEMOCKS(p, A) 
2: F+eo,p <p 


3: for all f => 8 € Ado 

4: def f(T){...} < the definition of f in p 

5: fu 4 FRESHNAME(f) 

6: s+Í] 

TO Se {HE S|O=[f... Eg} 

s ae {GE S| 1=[f..JE d)} 

9: for all ¢ € ı do > convert asserts into assumes 
10: f(E) < the lone function call in ¢ 

11: Pu — Of (E) := fu(®)] > substitute uninterpreted function 
12: We {zi =e, |0<i< |z|} > equate parameters to arguments 
13: b + (No) A (AF) => ou > the condition where ¢ holds 
14: $" © [FV(9); Y F ¢'] 

15: S.append(assume ¢”) 

16: end for 

17: fm < FRESHNAME(f) 

18: Ff] < fm 

19: dm < def fm (Z){ > create the mock definition 

5 
return fu(Z) 

20: PET E 


21: end for 
22: end function 


removes any @ € ®¢ whose free variables overlap with free variables outside of 
s. The process iterates in case free variable dependencies cascade. For example, 
the existence of assert g(x) would eliminate assert f(x-y), which would in 
turn eliminate assert f(y). The result is the transitive closure of the allowable 
assertions about each function. 


Generate mocks. Next, Algorithm 2literates through each function in the domain 
of A, generating a corresponding mock to add to the augmented sketch p’. As 
it does so, it also builds a map F from function names to the names of the 
generated mocks. 

For each f + ® € A, GENERATEMOCKS begins by finding the definition of f 
and creating a corresponding freshly named uninterpreted function fu. It then 
initializes s, the assumptions to be inserted into the new mock body, to empty. 
Then, from each asserted formula ¢ € &, the algorithm creates a formula ¢,, by 
substituting the single function call f(é) in ¢ with a call f.,(@), where 7 are the 
formal parameters of f (line|11). Notice this call to f,, is the same no matter the 
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Algorithm 3 Mock rewriting: mock harnesses 

Input: p’ - the sketch from Algorithm [2] 

Input: F - the name map from Algorithm [2] 

Output: pm - the sketch augmented with mock harnesses 
1: function MocKHARNESSES(p’, F’) 


2: G © CALLGRAPH(p’), Dm © p’ 

3: for i + 1, maximum mock call graph depth do 

4: Fişi {= () 

5: for all def g(i) {5} € CALLERS(G,dom F;) do p similarly, harness def 

6: g' < FRESHNAME(g) 

T: d' + def g' (yf > respectively, harness def 
{lf = f | fo fie F)|s€ 3} 

8: P before g) 

9: Fisilg] — 9! 

10: end for 


11: end for 
12: end function 


original call to f, which ensures the generated mock conforms to the technical 
requirements Sketch imposes on models. To encode the actual information at 
the call site, we next add a precondition. The algorithm constructs ¢’ (line[13), 
which is an implication denoting that ¢, holds if the ancillary asserts ọ, and 
the equalities x; = e; from the call to f hold. One nuance we elide here is that 
Sketch augments all function calls with an additional explicit path condition 
parameter that captures conditional branches taken up to the point of the call, 
which makes it easier for Sketch to translate the IR into a SAT formula. For 
soundness, we include this path condition as a premise of ¢’ and assign f, the 
path condition T. Note that our implementation trims Po before adding it to ¢’ 
to the subset containing only the variables in €. 

Next, the algorithm performs quantifier elimination on ¢’, yielding o” (line[1.4p. 
More precisely, [FV (¢); ¥ + ¢’] eliminates variables in FV(¢) from ¢’, searching 
for witnesses in W. Then, $” is added to 5 as an assume, and the loop continues 
until all mappings for f have been handled. 

Finally, on lines[17]19] the algorithm computes a fresh Sketch name fm for f, 
adds a mapping to F, and creates function definition dm for fm. The function fm 
takes the same arguments as f, assumes all formulas in §, and returns fu on fm’s 
arguments. Thus, when fm is called, the assertions about f from its original test 
suite in p are assumed on fm’s arguments, as we saw in Section 2] The definition 
dm is added to p’, and mock generation continues until all mappings in A have 
been traversed. 


New mock harnesses. The last step of Sketcham adds calls to the mocks gener- 
ated by GENERATEMOCKS. One naive approach would be to simply replace each 
call to f with a call to fm for all f œ> fm E F. However, this will not work for 
two reasons. First, we need a full solution for the holes in all functions, including 
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those that are mocked. Replacing calls to f with calls to fm would remove many 
constraints on the holes in f, underconstraining their solutions. Second, as we 
saw earlier the template for f might contain additional information excluded by 
BuILDASSERTMAP, so replacing f by fm might underconstrain f’s callers. 

Our solution is to create an output sketch that includes both the original 
sketch—including all calls to f in their original form—and duplicate sketch code 
that calls fm in place of f. The duplicated code refers to the same holes as 
the original sketch. Hence, information derived from the duplicated code can 
potentially greatly speed up solving of the original code. 

Algorithm [3|shows MocKHARNESSES, which creates this duplicate code. The 
algorithm begins by constructing a call graph G from the sketch p’ from the 
previous step. Note that none of the mocks in p’ are called yet, so the call graph 
is the same as for the original sketch. Next, the algorithm duplicates the sketch 
one level of the call stack at a time, starting at the mocks and working up toward 
the harnesses. To limit duplication, e.g., for mocks called by recursive functions 
whose duplication would loop infinitely, the algorithm bounds the duplication 
depth. For each level į, it iterates through all functions g E€ CALLERS(G, dom F;), 
meaning functions g that call a function in the domain of F;. It duplicates each 
such g, replacing calls to functions f € dom F; with calls to F;[f], and then adds 
the duplicated function to the sketch. Since g has now been renamed, g++ g’ is 
added to a new mapping F;+1, and calls to it are duplicated in the next iteration, 
repeating until reaching the root of the call graph or the maximum duplication 
depth. Note the process is the same for both regular function definitions and for 
functions that are harnesses. 

For example, suppose harness A calls function g, which in turn calls function 
f, and assume GENERATEMOCKS created fm and gm. Then in the first iteration, 
MOocKHARNESSES creates a duplicate h’ that calls gm and a duplicate g’ that 
calls fm. In the next and final iteration, it creates a duplicate h” that calls g’. 

When we insert the duplicate functions, we insert them before the original 
functions. This ensures that when we insert the duplicate harnesses that call the 
mocks, Sketch will solve those harnesses before solving the original ones. 


4 Evaluation 


We evaluated Sketcham on ten benchmarks, running each from 11 to 1487 times 
until reaching statistically significant results. We found that, for six of ten bench- 
marks, Sketcham performs up to 5x faster than Sketch, for one benchmark 
Sketcham is slower by a factor of up to 0.9x, and for the remaining three bench- 
marks performance is indistinguishable. We examined the benchmark dedup 
(Figure [1} in depth and found that, as suspected, overall performance improve- 
ment is due to improved synthesis time when using sort_mock. 


Implementation. Sketcham comprises approximately 1075 lines of C++ code 
within the Sketch backend. The user enables Sketcham with -mock and specifies 
the max mock duplication depth via --bnd-mock-depth, which defaults to 3. 
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Because they clone and then rearrange the input Sketch IR program, the 
run time of Algorithms is approximately linear in the number of functions 
and the number of asserts in the sketch. Our implementation covers the features 
given as part of the Sketch IR fragment in Figure |4} with the modification that 
we explicitly depict assignment, which Sketch IR does not require because it 
structurally hashes expressions to yield a compact in-memory representation : 
We also note that Sketch includes additional features that we leave to future 
work, such as complex harness types, and that quantifier elimination is currently 
restricted to arithmetic expressions. 


Benchmarks. We used the following benchmarks: 


— double, the integer doubling program given in Figure 

— absval, the absolute value function. 

— fib, the linear-time Fibonacci function. The specification requires its output 
to be equivalent to the exponential time algorithm. 

— datetime, a simplified implementation of the C strptime function. This func- 
tion accepts a format that it uses to parse a date/time string. 

— boyerMoore, which implements the Boyer Moore string search algorithm [8]. 

— regex, a regular expression matching engine and compiler. 

— spellcheck, a program that suggests a corrected version of its input using the 
Levenshtein edit distance from entries in a dictionary. 

— minpair, uses edit distance to find the closest pair out of an array of values. 

— dedupm, deduplication with merge sort from Figure [I] and dedup;, dedupli- 
cation with insertion sort. 


Sketch has a multitude of configuration options that can have a large effect on 
performance. The middle portion of Table|1}gives values for the four options that 
differ across the benchmarks: int type, whether Sketch uses symbolic integers (in 
either a bit-vector encoding or a sparse encoding [26]) or native integers [28]; 
int bits, the number of bits per integer; loop unroll, the maximum loop unrolling 
depth; and func inline, the maximum depth of function call inlining. 

We selected values for these options that reflected each benchmark’s design 
and demonstrated pronounced run time differences from Sketch to Sketcham, as 
follows. double and absval use Sketch’s defaults. fib tests recursively computing 
the Fibonacci sequence up to the tenth entry, so function call inlining is set 
accordingly. regex is required to reject bad matches, which requires higher un- 
rolling and inlining. datetime, boyerMoore, spellcheck, and minpair need higher 
loop unrolling to iterate over long strings. These last three and both dedups also 
do much better using native integers. The dedups also run unreasonably slowly 
with more bits or higher unroll, so we reduced the amount of unrolling. In all 
our benchmarks, any configuration options not discussed here were left as their 
defaults, including the mock duplication depth, with the default of 3. 


Methodology. All measurements were taken on a 3.2 GHz AMD Ryzen 5 1600 
system with 32GB of RAM. We found that while most benchmarks consistently 
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# mra int int loop func|Sketch runs|Sketcham runs 

lines holes type bits unroll inline total failed|total failed 

double 8 l/symbolic 5 8 5 17 0 17 0 
absval 69 9\symbolic 5 8 5 17 0 17 0 
fib 46 Alsymbolic 6 8 10} 20 o 65 0 
datetime 177 3\symbolic 11 20 5 11 11} 17 0 
boyerMoore 136 16| native 7 13 5 17 0} 153 19 
regex 357 5604|symbolic 5 30 7T 17 0 17 0 
spellcheck 94 5| native 5 9 5 17 0 17 0 
minpair 113 3| native 5 10 5 17 0 22 2 
dedup; 73 1134| native 2 4 5) 1487 88| 762 23 
dedupm 80 9008} native 2 4 5| 648 281; 88 16 


Table 1: Benchmark config options and characteristics. 


perform within half an order of magnitude under both Sketch and Sketcham, 
in a few cases synthesis time varies by as much as two and a half orders of 
magnitude. To account for this variance during our evaluation, we repeatedly 
ran each benchmark until achieving statistical significance, between 11 and 1487 
times, as listed in the rightmost portion of Table [I] Each run was executed with 
the system otherwise almost totally idle to minimize interference. While most 
runs completed successfully, we exclude those that exceed a 60 minute timeout 
or fail to synthesize due to exhausting system memory or a crash within Sketch. 
To give an idea of the problem size, the leftmost portion of Table [1] lists the 
numbers of lines and holes per benchmark. 


As other work has observed [9], performance evaluation methodologies that 
lack rigor can lead to misleading and incorrect conclusions. To avoid this prob- 
lem, we collect enough data to calculate a percentile’s confidence interval (CI) 
at a given confidence level (CL). We employ the classic Clopper-Pearson [6] (or 
“exact” ) method using the probabilities of the Binomial distribution to itera- 
tively calculate confidence intervals for a given dataset. While other methods 
are often used, many of these assume an underlying Gaussian distribution. The 
underlying distributions for our measurements are not known and do not appear 
to be Gaussian, a case the exact method handles correctly. 


Run time variance is not correlated across configurations, so the number of 
runs needed for significance can differ from Sketch to Sketcham, as reflected 
in the “total” columns of Table }1} We ran each configuration repeatedly until 
measurements met two statistical significance conditions. First, that they reach 
a 95% CL that the population median lies within at most a 20% CI around 
the sample median. For example, for a sample median of 100s, the population 
median might lie between 90s and 110s, or between 98s and 118s, depending 
on the underlying distribution. Second, the CI must range entirely between the 
first and third quartiles to increase the confidence that the median measurements 
adequately reflect the underlying distribution. In seven out of ten benchmarks 
these two conditions were sufficient to yield CIs that did not overlap across Sketch 
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Fig.5: Total time (s). Times are drawn as notched box plots, which give the 
distribution’s median inside a notch indicating its confidence interval. As usual, 
the box extends to the first and third quartiles, and whiskers extend to the 
full distribution. To better focus on the data, we truncate some whiskers. Note 
differing y-axis scales both here and below. 


and Sketcham, which allows for statistically significant performance claims about 
these benchmarks. 


4.1 Performance 


Figure [5] shows the running times of Sketch and Sketcham on our benchmarks. 
The distribution of times is shown as notched box plots. The boxes extend from 
the first to the third quartile, with the median shown as a mid-line. The CI is 
indicated by the notch. The whiskers extended to the minimum and maximum 
values (some whiskers are truncated to allow for a closer view of the median). 
Following standard practice, we conclude that two configurations have a sta- 
tistically significant difference in performance if their CIs do not overlap, as there 
is then high probability that the median times of the distributions are different. 
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We see that for six of the ten benchmarks, Sketcham is faster than Sketch, while 
one is marginally slower and three display no significant performance change. We 
investigated each benchmark’s performance in detail, discussed next. The per- 
formance differences we report are ratios of the run time of Sketch to Sketcham 
for a given benchmark. Due to uncertainty we report speedup ranges for the me- 
dian, comparing the opposite extents of each CI. This ranges from, at minimum, 
the ratio of the faster end of Sketch’s CI to the slower end of Sketcham’s CI, up 
to, at maximum, the ratio of the slower end of Sketch’s CI to the faster end of 
Sketcham’s CI. 

The times shown are total run time, which can be broken down into synthesis, 
verification, and overhead time. For Sketcham, overhead can further be broken 
down into mock construction and normal Sketch overhead. The total runtime 
overhead of mock construction is less than 0.4% for all benchmarks except regex 
(3%) and both dedups (~20%). In most cases, this time was dominated by the 
GENERATEMOCKS and BUILDASSERTMAP phases. 

The double benchmark’s performance is approximately the same in both 
cases. In fact, the CIs overlap almost completely, suggesting the performance 
may be dominated by constant factors in Sketch. 

The absval benchmark is also approximately the same. It is another simple 
program that Sketch solves very quickly, and as such the mocks only add to the 
verification time. 

The fib benchmark asserts that, on integers 0 to 9, the to-be-synthesized 
linear-time Fibonacci implementation returns the same result as an exponential- 
time implementation. In Sketch, the calls to the exponential-time algorithm 
cause a slowdown. But since Sketcham replaces calls to the exponential-time 
algorithm with calls to a (constant-time) mock, Sketcham achieves a speedup 
of 3.8-4.5x. While it is difficult to make out in the plot, the median and CI lie 
immediately above the first quartile for Sketcham. 

The datetime benchmark fails to synthesize in Sketch due to memory ex- 
haustion, but it consistently synthesizes in just a few seconds using Sketcham. 
Investigating further, we found the bottleneck is a function that parses strings 
into integers in a loop that converts digits and adds them to a running total. 
For example, the digit sequence abc is converted to the integer 100*a+10*btc. 
This conversion loop is unrolled to the maximum bound by Sketch, and the in- 
put strings are of varying sizes, which is encoded as a separate formula for each 
possible length. The SAT conversion algorithm translates symbolic arithmetic 
formulas according to combinations of possible values of their subformulas, which 
results in very large SAT formulas in this case. Later in the conversion, these 
are merged back together in another quadratic operation. Due to the number 
of formulas and overall formula size, this eventually exhausts memory. While 
Sketcham technically faces the same issue, it does so after decomposing the 
sketch into smaller formulas, and thus these limits are never approached. 

The boyerMoore benchmark runs 4—5x faster under Sketcham than Sketch. 
The reason is similar to the previous case. boyerMoore includes a generator that 
constructs arithmetic expressions that add and subtract a small set of values 
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including a hole. Sketch constructs these expressions recursively so they grow 
quickly, with the total number of terms determined exponentially by the degree of 
function inlining, and the resulting expressions have high symmetry, both factors 
that slow down solving, further compounded by the location of this expression 
deep within the sketch. Because Sketcham breaks the problem’s dependencies, 
this expression can be synthesized separately from the rest of the program, which 
proceeds much more quickly. 

The regex benchmark’s overall performance using Sketcham is statistically 
significantly slower by a factor of 0.98, which is a minimal difference in practice. 
The main mocked function here performs compilation of a regular expression 
into instructions for a virtual machine. Because compilation is recursive, it is 
difficult to give a specification that Sketcham can use. It is instead given by 
example with an exhaustive set of subproblems, which greatly increases the 
number of harnesses to solve. While most harnesses keep similar performance 
and the slowest harness is 8% faster in Sketcham, this is not enough margin to 
improve overall solve time. 

The spellcheck benchmark using Sketcham sees a speedup of 1.5x, while 
minpair performs roughly the same (0.89-1.04x). Both rely on the same Lev- 
enshtein edit distance algorithm. The harness for this algorithm, which is the 
most time-consuming in either sketch, runs last in both settings, which reveals 
the source of the performance difference between the two benchmarks. minpair 
is dominated by synthesis time and spellcheck by verification time, which means 
that harnesses for the minimum pair function are more difficult to synthesize 
than for the spellcheck function, and so the former accumulates more state within 
the solver that is compounded when solving the Levenshtein harness. This slows 
it down enough to decrease the overall performance. On the other hand, the im- 
provement of spellcheck is distributed across all individual harnesses, and across 
both synthesis and verification time, more than making up for the time it takes 
to construct and solve the mock harnesses. 

Finally, the dedups show a notable performance improvement with Sketcham. 
In both dedup; and dedup,,, the problem is large and complex enough that plain 
Sketch struggles with it. Sketcham eliminates the interactions of holes across the 
deduplication and sorting functions, which speeds up synthesis by a factor of 
1.3-1.9x for dedup; and 1.003-1.5x for dedupm. 


4.2 Case Study: Deduplication 


Next, we examine the performance of dedup; and dedup,, in detail, as they 
illustrate the strengths and weaknesses of Sketcham. We break our discussion 
into comparisons of solving time across harnesses and comparisons of CEGIS 
synthesis time to CEGIS verification time. 


Time to Solve Each Harness. Both dedup; and dedup,, are structured the same 
way, and Sketcham creates the harnesses and mocks shown in Figure [Ic}for both. 
Figure [6] breaks down the total times for dedup; and dedupm, grouped by the 
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Fig. 6: Harness time (s) 


harnesses for sort and for dedup. We exclude overheads such as time spent in 
mock construction, parsing the input, and reassembling the output. 

We make several observations. First, comparing the first and third columns 
within each subfigure, we see the time for solving h_sort is the same for Sketch 
and Sketcham. This makes sense because h_sort' adds no information—it calls 
mocked sort and then immediately asserts the same specification as in the mock. 
Note that, while the trivial h_sort' harness could be elided here, creating an 
analogous harness would be useful if the harness accidentally contained a con- 
tradiction. In such a case, Sketcham would almost instantly decide the harness 
is unsatisfiable, whereas Sketch could spend an arbitrary amount of time rea- 
soning about the computation in the actual called function before detecting the 
contradiction. 

Second, comparing the second and fourth columns within each subfigure, 
we see that the CI of h_dedup using Sketcham lies well below the CI using 
Sketch. The speed improves by a factor of 3.2—4.7x for dedup; and 2.2—4.9x for 
dedupm. Examining this result in detail, we find that Sketcham works exactly 
as intended: h_dedup'' calls the mocked sort, enabling it to synthesize quickly 
and assign holes correctly, which are then simply verified when checking h_dedup 
(and h_dedup' is trivial, similarly to h_sort'). 

Third, also comparing the second and fourth columns, we see the variance 
in performance for Sketch is much greater than for Sketcham. Investigating fur- 
ther, we found this occurs for two reasons. First, the specification in h_sort is 
weak enougl?] that sometimes an incorrect hole assignment for sort satisfies the 
verifier and is only discovered while synthesizing h_dedup, forcing the solver to 
backtrack at great cost and simultaneously consider the holes in both functions. 
Second, even when the solver finds a correct assignment for sort, it includes the 


2 In addition to the specification we have supplied, a complete specification of sort 
relies on the existence of a permutation function over the array’s indices. 
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entire formula again while solving h_dedup, resulting in a much larger problem 
and corresponding variability. In contrast, with Sketcham, h_dedup'' is decou- 
pled from sort, eliminating these issues. 

Fourth, we observe that both Sketch and Sketcham can solve h_sort about 
10x faster for dedup; than for dedupm. Overall, merge sort is more challenging for 
Sketch than insertion sort (note that since Sketch finitizes the problem by, e.g., 
unrolling loops, asymptotic complexity does not play a role). More surprisingly, 
synthesizing h_dedup is also faster for dedup; compared to dedupm. We believe 
this occurs because synthesis of h_dedup must sometimes recover from a bad 
hole assignment from h_sort, which will be quicker for dedup;, and because the 
easier synthesis of dedup; means the solver accumulates less state, such as conflict 
clauses, that would otherwise slow down solving subsequent harnesses. 

Finally, we begin to get a clearer picture of the divergence between dedup; 
and dedupm. In dedup;, h_dedup synthesis is the performance driver, and the 
improvement using Sketcham has a significant impact on total performance im- 
provement. In dedup,, it is overshadowed by h_sort, which dominates to the 
point that improvement elsewhere is not as significant a contributor. Combined 
with the overhead of mock construction, this leads to a less pronounced improve- 
ment in total performance. 


Synthesis and Verification Time. Figure[7|shows the times for the CEGIS synthe- 
sis phase and verification phase for each benchmark under Sketch and Sketcham. 
Not shown are the overheads of mock construction, parsing, etc., which for dedup; 
we found took 3—4s in Sketch versus 17—19s in Sketcham, and for dedup,,, took 
90-96s in Sketch versus 201—207s using Sketcham. We believe much of the dif- 
ference between these could be eliminated with additional engineering effort. 
Looking at verification times in Figures|7c]and [7d] we see that while the veri- 
fication times for Sketch and Sketcham are different, they are still relatively close: 
Sketcham is 0.81—0.86x slower for dedup; and 1.12—1.16x faster for dedup,,. In 
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contrast, comparing synthesis times in Figures [7a] and we see a more signif- 
icant speedup for Sketcham over Sketch: 1.59-2.55x for dedup; and 1.28-2.28 x 
for dedup,;,. Moreover, if we compare synthesis and verification time, we see that 
the overall solving time for both benchmarks is dominated by synthesis time. 
Indeed, we observed even greater synthesis speedups on other benchmarks in- 
cluding fib (4.2-5.1x) and boyerMoore 5.2-6.9x, but the most extreme of which 
was spellcheck, which saw synthesis speed up by 308.4—345.7x using Sketcham. 
Thus, we find that Sketcham’s performance improvements come from reducing 
synthesis time by introducing mocks that decrease the number of holes that need 
to be considered at once. 


4.3 Discussion 


In general, we found that Sketch’s performance is unpredictable in practice, 
which is influenced by factors such as the solver’s random seed. For example, 
in terms of overall solving time, our experimental runs included several outliers 
(not shown in Figure |5) near the 60 minute timeout. In these cases, Sketch 
essentially makes a very poor initial guess for the holes, and verification produces 
counterexamples that do not add much information. Both Sketch and Sketcham 
exhibit this issue. 

Moreover, often what seem like minor changes in the program sketch or con- 
figuration options can result in totally different solver behavior, and hence perfor- 
mance. One example of this was boyerMoore, which turned out to be non-linearly 
sensitive to the loop unrolling parameter. This benchmark was also extremely 
fickle about the problem formulation—holes in what seemed to be innocuous lo- 
cations would lead to timeouts in both Sketch and Sketcham. Another example 
is dedup, which initially had a specification that omitted a requirement that the 
output array did not have a negative length. Without this constraint, the per- 
formance benefit of Sketcham was overwhelmed by the variability of the solver 
exploring ultimately impossible scenarios. 

Overall, our results suggest that while Sketcham can’t always outperform 
plain Sketch, it performs best on problems split into functions whose tests cover 
the behavior the sketch actually relies on while being easier to compute than the 
functions’ actual implementations. While Sketcham affected the performance of 
both CEGIS phases, the best improvements were observed when the solving time 
of dependencies was dominated by the synthesis phase. For programs with these 
properties, Sketcham can exhibit a performance improvement of as much as 5x 
overall, with synthesis time improvements alone of up to 345.7x. Moreover, in 
some cases, such as datetime, Sketcham can solve problems that are out of reach 
of plain Sketch. For programs where these properties do not hold, Sketcham 
performance is typically similar to plain Sketch. 


5 Related Work 


There are several threads of related work. 
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Program Synthesis with models. As discussed earlier, our work builds on 
work by Singh et al. [24], who propose manually created models for Sketch. 
While Sketcham relies on the core algorithm of that work, Sketcham frees the 
Sketch user from needing to write models, because we create mocks automat- 
ically from normal sketches. Mariano et al. use algebraic specifications to 
model libraries. In contrast, our approach derives specifications from the input 
program’s assertions, without requiring the programmer to add annotations. 


Deriving mocks and specs from tests. Saff et al. use the capture and re- 
play of actual test executions to automatically generate mock dependencies with 
the goal of speeding up test execution. Fazzini et al. [8] further generalize this 
capture-and-replay technique to consistently model the environment of a mo- 
bile app under test, allowing for testing apps that use an inconsistent resource 
like a database or network device. Both of these target normal testing rather 
than synthesis. Nguyen et al. leverage symbolic execution over input-output 
test pairs to perform program repair. However, they use these tests to model 
individual expressions instead of modeling entire functions. The insight under- 
lying these approaches is similar to ours, however Sketcham is capable of both 
input-output pairs and general properties, and does not rely on either concrete 
or symbolic execution of tests. 


Component-based synthesis. Gulwani et al. [IO] model programs using logical 
input-output relations to synthesize loop-free bit-vector programs. Shi et al. [23] 
combine many solutions that each only partially meet a specification into one 
that meets the entire specification. Both approaches limit the synthesis search 
space by building their solutions from the bottom up, from a selection of base 
components. Smith and Albarghouthi [25] prune the search space using bottom 
up algebraic rewriting of the program into an equivalent normal form. In contrast 
to these, Sketcham derives its benefits from breaking apart input sketches from 
the top down, at function level granularity. 


Modular synthesis using symbolic or actual execution. Samak et al. [22] de- 
rive specifications of class methods using symbolic execution and use them to 
synthesize a replacement shim class one method at a time. Van Geffen et al. 
use symbolic execution to model abstract virtual machines to modularly syn- 
thesize a compiler one instruction at a time. In contrast, because our approach 
derives mocks directly from the input’s assertions, we need not consider the code 
itself when modeling it. Hua et al. modularize the synthesis of library calls 
through execution of actual partial programs. In contrast, we attempt to avoid 
called functions entirely by relying on their inferred specifications. 


Other approaches. Bodík et al. |2| finalize incomplete programs using angelic 
nondeterminism. In contrast, Sketcham does not introduce arbitrary angelic val- 
ues, but instead constrains any angelic-like behavior using a function’s inferred 
specification. Huang et al. [12] use a divide-and-conquer strategy to iteratively 
split synthesis problems according to heuristics. In contrast, Sketcham splits 
problems structurally in a single pass. Polikarpova et al. [20] speed up synthesis 
through modular verification using refinement types. In contrast, our approach 
achieves a similar kind of modularity without being type-directed. 
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6 Conclusion 


This paper presents Sketcham, a new technique for decomposing program 
sketches during synthesis by turning a function’s test suite into a mock that 
a caller can invoke in place of that function, thereby allowing separate reasoning 
about callers and callees. Sketcham gathers asserts from tests into a specifica- 
tion for each function which it embodies as a Sketch model. We implemented 
Sketcham as an additional pass with Sketch and evaluated it on a set of ten 
benchmarks. Our rigorous evaluation strategy ensured at a confidence level of 
95% that our measurements demonstrate performance gains of as much as 5x, 
including one benchmark that otherwise timed out on Sketch. Based on these re- 
sults, we believe that automatically generating mocks from tests with Sketcham 
is a promising new approach for achieving modular synthesis. 
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Abstract. Quantifier bounding is a standard approach in inductive pro- 
gram synthesis in dealing with unbounded domains. In this paper, we 
propose one such bounding method for the synthesis of recursive func- 
tions over recursive input data types. The synthesis problem is specified 
by an input reference (recursive) function and a recursion skeleton. The 
goal is to synthesize a recursive function equivalent to the input func- 
tion whose recursion strategy is specified by the recursion skeleton. In 
this context, we illustrate that it is possible to selectively bound a subset 
of the (recursively typed) parameters, each by a suitable bound. The 
choices are guided by counterexamples. The evaluation of our strategy 
on a broad set of benchmarks shows that it succeeds in efficiently syn- 
thesizing non-trivial recursive functions where standard across-the-board 
bounding would fail. 


1 Introduction 


Most computational tasks can be broken into logical units, many of which involve 
evaluating a function over a data collection. Recursively defined data types are 
broadly used to implement these collections. In functional languages, recursive 
functions implement computations over these recursive data types. Consider a 
typical scenario where a programmer has implemented a function f over a col- 
lection C by defining a recursive data type A and implementing f as a recur- 
sive function foo,. Later, the programmer may need a different implementation 
foog of f over a different data type B; perhaps B is better suited for an opti- 
mized implementation of f, or the programmer now needs an implementation 
of a new function g (in addition to f) over the collection C and the data type 
B is a much better choice than A for implementing g efficiently. Ideally, the 
programmer should not have to start from scratch implementing foop. 

In this paper, we propose a generic and efficient algorithm for synthesizing 
recursive functions in such contexts. Our synthesis problem is specified by the 
following three components: (1) a recursive reference implementation that pre- 
cisely defines the functionality, (2) a high level recursion skeleton that specifies 
a recursion strategy (i.e. a traversal plan over the new recursive data type) for 
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the target code, and (3) a mapping, called representation function, that converts 
an instance of the new data type to one of the old data type (of the reference 
implementation), and establishes that the two are different implementations of 
the same concept. 

Let us illustrate our problem setup with the aid of an example. Consider the 
standard A-labelled binary trees, recursively defined as T — Nil | Node(A,T,T) 
for an arbitrary type A, and the maximum in-order prefix sum (mips) 
function depicted on the ‘let mips t = aux (0, 0) t re 
right. mips maintains a pair jand aux s t = 
of values: sum, which keeps | pateh t with 
track of the sum of the ele- l a mps 
ments it has traversed so i; let sum, mps = aux s 1l in ; 
far, and mps, which main- | aux (sum + a, max (sum + a) mps) r; 
tains the maximum value 00 
over all such sums. This ref- 
erence implementation pre- 
cisely defines the functional specification for a function f. 

Suppose that the programmer needs an alternative implementation that can 
be efficiently parallelized, and {Vet h t 5 
therefore, opts for the divide- | match t with 
and-conquer recursion skeleton B ae : 

: | Node(a,l,r) -> join a (h 1) (h o 
depicted on the right. The par- Teetssisesssisiiisisiissiiinnitiestiisiantin aititak 
tially defined code specifies that the tree should be traversed in a manner that 
each subtree is processed separately, and then the results should be combined by 
a function join. It does not, however, specify what computation is performed; 
the implementation of join and the initial value for sO are unknown. In this 
example, labeled binary trees are the recursive data type for both the reference 
implementation and the target of synthesis. In cases like this, the representation 
function simply becomes the identity function. 

Our algorithm reduces the problem to a set of recursion-free synthesis prob- 
lems, which are solved using existing synthesis tools. It synthesizes the unknown 
computations for join and sO, and therefore produces the divide-and-conquer 
implementation of mips on binary trees: 


ilet s0 = (0, 0) : 
ilet join a (s1, m1) (s2, m2) = a + si + s2, max m1 (m2 + a + 81); 


Fig. 1. Maximum in-order prefix sum 


At the high level, the problem of synthesizing a new recursive function can be 
framed as checking the validity of formulas of the type 4fVz : 6.¢(f,z,...) where 
0 is a recursive data type (i.e. x ranges over a set of inductively defined terms), 
f is the target recursive function, and the ellipses stand in for all the relevant 
components of our specific problem statement as outlined before. Elements of 
type 0 are unbounded in two different dimensions: the recursive structure can be 
of arbitrary size and each element of it belongs to an unbounded (data) domain. 
A straightforward way of under-approximating the unbounded specification is 
to bound the universal quantifier Vz : 0 in both dimensions. The synthesis prob- 
lem is reformulated to synthesize the function from a bounded set of examples 
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which are concrete bounded elements of the data type with concrete elements in 
them. This can be done by applying a counterexample-guided inductive synthesis 
(CEGIS) [34] algorithm in the straightforward way. 

Alternatively, one can attempt to tackle the two dimensions independently. 
The quantifier Vz : 0 can be bounded in one dimension, i.e. recursive structures of 
bounded size can be considered, and yet the elements of these bounded structures 
can range over unbounded domains. More formally, the universal quantification 
is instantiated over a finite set of bounded-depth terms, denoted by set T, and 
the resulting specification becomes 4f.va € D. Ner ¢(f,t) where a are the 
free variables of the terms in T and of non-recursive type D. This bounding 
reduces the original problem to a standard functional synthesis problem (over 
unbounded data domains) that can be discharged to one of the many known 
solvers, employing a variety of techniques for it. The set of terms in T can 
still be discovered in a counterexample guided loop in the spirit of CEGIS, and 
therefore this algorithm can be viewed as a symbolic CEGIS variant. 

The thesis of this paper is that forcing bounds on all recursively typed vari- 
ables is unnecessary and can be avoided algorithmically. A subset of variables can 
retain their unbounded quantification and yet the problem can be reduced to a 
recursion-free functional synthesis instance. Recall the mips example. The join 
function takes two trees, 1 and r, and a value a as an input. The recursion-free 
specification for join can retain a universal quantifier on all trees for 1 and limit 
its bounded exploration to r. In other words, one can successfully synthesize the 
join function from examples enumerating a few small candidate trees for r and 
treating h(1) (i.e. the result of the computation on 1) and not 1 itself for the 
inductive enumeration of examples for synthesis. We discuss in the paper how 
this information can be algorithmically derived from the specific components of 
our synthesis problem: the reference implementation, the recursion skeleton, and 
the representation function. 

Beyond the decision on what quantifiers should be bounded, the synthe- 
sis algorithm also needs to determine a set of terms that are used to bound 
these quantifiers. We propose an algorithm that discovers these bounds guided 
by counterexamples in a refinement-style loop. We show that this algorithm is 
sound, satisfies the expected weak-progress property that other CEGIS instances 
have, and is parsimonious in a precise sense. We have implemented this algo- 
rithm as a prototype synthesis tool SYNDUCE and demonstrate that SYNDUCE 
can efficiently synthesize recursive functions from specifications. 


2 Background and Notation 


The notation introduced in this section is used for formalizing the result of 
applying recursive functions to symbolic inputs. 


Terms. We make use of a set of symbols that are partitioned into terminal 
symbols X, non-terminal symbols N, and an infinite set of typed variables V. 
There is a unique symbol oy that stands for a hole. Terms are defined by the 
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grammar T — x | T T where x is a symbol, and T T is a function application. 
These are the relevant classes of terms: 


— Concrete terms T(X) are those containing only terminal symbols. Every con- 
crete term can be interpreted and has a concrete value. 

— Symbolic terms T(X, V) are those containing terminal symbols or variables. 

— Closed terms T(3’,N) are those containing terminal or non-terminal symbols, 
but no variables. 

— Applicative terms T(X, N,V) are those containing any symbol except the hole 
symbol. 

— Contexts T(7,N,V,07) are those with at least one hole. A one-hole context 
C| ] is a context with a single occurrence of o7, and C[t] stands for the term 
formed by replacing the single hole in C| ] with the term t. 


Two terms are equal, denoted by t =, t (standard alpha conversion), iff there 
exists two injective substitutions o : FV(t) > V \ (FV(t)U FV(t’)) and oa’ : 
FV (t) 3 V\ (FV(t) UFV(t’)) such that ot = o't (i.e. syntactically equal). 

A symbolic term t can be expanded into a term t iff there exists a substi- 
tution o : FV(t) — T(FV(t’) U X) that substitutes the free variables of t for 
symbolic terms with the free variables of t’ such that t/ = ot. The relation = 
over symbolic terms, is a partial order defined as, t > t’ iff t can be expanded 
into t’. A single variable is the maximal element according to this partial order 
and concrete terms (of any depth) are minimal elements. 


Recursive Functions. This paper focuses on recursive functions f : T — D 
with terms of a recursive type (7 or 0) as input, and an output of type D. These 
functions can be executed on concrete or symbolic input terms of type T. We 
assume all functions can be translated to recursion schemes as defined below: 


Definition 1 ((26]). A recursion scheme is a tuple P = (X’,N,R, A) where: 


- X is a ranked alphabet of terminals 
- N is a finite set of typed non-terminals. 
- R is a finite set of rewrite rules, each in one of the following shapes (m > 0): 


(pure) F2...%m—t 
pattern matchin F zı ... £m p>t 
g, 


where the x; are variables, p is a symbolic term, t is an applicative term in 
T(XYUNU {z1,...,£n}), and F is a non-terminal. 

- A: T > D is a distinguished non-terminal symbol whose defining rules are 
always pattern-matching rules. 


We associate with each recursion scheme P a notion of reduction. A redez is 
an applicative term of the form F ox, ... 0%, op for a substitution o : Y > 
T(X,N,V) and rule F zı ... tm p —> t in R. The contractum of the redex 
is ot. The one-step reduction relation >C T(X, N,V) x T(X, N,V) is defined 
by C[s] =œ C[t] whenever s is a redex, t is a contractum and C[ ] is a one-hole 
context. A recursion scheme is deterministic iff for any redex F sı ... Sm there 
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is exactly one rule l — r (in R) which matches that redex, i.e. there exists a 
substitution 6 such that F sı ... sm =@ 1. 

Given a recursion scheme P = (X,N, R, A) and a term s € T(X, N,V), 
L(P, s) denotes the language of (YUMU F'V(s))-labelled trees resulted from the 
maximal rewriting of the term s with the one-step reduction relation associated 
to P. If P is deterministic, then L(P, s) is a singleton (the term s reduces to only 
one possible term), and [s]p denotes the unique resulting term. This notion of 
reduction is slightly different from the one used in [26], in that we do not require 
the substitution to be closed. 


Symbolic Evaluation. For any function f that can be defined as a recursion 
scheme, the symbolic evaluation of f on input s is simply [s]. In other words, 
f(s) = [s]. In this view, recursive functions and the corresponding recursion 
schemes are interchangeable. For a recursion scheme (X, M, R, A) representing a 
function f and a variable x, f(x) and A x become two different ways of referenc- 
ing the same concept. In this paper, we assume that all recursion schemes to be 
deterministic total functions. Specifically, they terminate on all inputs; symbolic 
evaluation (or the equivalent reduction) of a symbolic term always terminates. 


Types Notation. We use capital letters A,B,C, and D to refer to base types, 
which are scalar types (int, bool, char,...) or unlabeled products of scalar types 
(e.g. int x int). Our focus is on functions that take as input elements of recursive 
variant (or sum) types denoted by 7,0,.... We denote by «1,...,n the con- 
structors of a variant type 7 with n variants. Each constructor is assimilated to 
a terminal symbol 7, x... X Tk —> T, where k > 0. We assume that all recursive 
types define finite structures, that is, one can always construct a term of type 
T with a finite number of constructors and elements of base type. x : T denotes 
the judgement x is of type 7, and Vx: T denotes the universal quantification of 
all variables x of type T. 

In this setting, where we distinguish base types and recursive types, we dif- 
ferentiate bounded terms, which are symbolic terms where all free variables 
are of base type (in Vg), and unbounded terms where some variables can be 
of recursive type. An unbounded term t is a symbolic term of finite size, but 
there are infinitely many bounded terms that are expansions of t. 


3 Formal Definition of the Synthesis Problem 


The synthesis problem solved in this paper is defined by three components: a 
reference recursive function f : T — D, a representation function r : 0 —> T 
that maps inputs of the target function to those of f, and a recursion skeleton 
for the target function. All three components are formally modelled by recursion 
schemes (Definition 1). f and r are standard recursive functions representable by 
deterministic recursion schemes. The recursion scheme for the recursion skeleton 
S[=] : 0 — D includes a special set = of symbols as a subset of its terminal 
symbols, which correspond to the unknown components for synthesis. These 
unknowns stand for constants or functions that have to be synthesized. 
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At the high level, the solution to the synthesis problem is the definition of a 
new recursive function. At the low level, each of the unknowns in = need to be 
given a definition. In each problem instance, it is assumed that f and S[=] use a 
common set of terminal symbols X that belong to a background theory T (e.g. 
linear integer arithmetic). Formally, the solution is identified by a mapping Z 
from the unknowns £ to function definitions Ar,....Av,.t where n > 0 and t is 
a symbolic term in T(X,{£1,..., £n }) (a concrete term if n = 0). Let S[3/Z] be 
the recursion scheme obtained by replacing the unknowns = by their definition 
in Z. Any solution Z that satisfies the following specification is a valid solution: 


Y =Va«:0,S[5/Z|(x) = (f o r)(x) 


Example 1. We use a problem instance with the goal of synthesizing a recursive 
function on tree paths as a running example of this paper. Recall the mips func- 
tion given in Fig. 1. Suppose that we want to transform it to a function on tree 
paths! as an alternative data type to labelled binary trees. For an A-labelled tree 
(of type Tree), Path is a datatype defined by the following grammar: 


Path — Top | Zip((T|L), A, Tree, Path) Q 
(b) 
Intuitively, a path decomposes a tree as shown on the right. 


The path Zip(T, a, ta, Zip(L, b, ty, Zip(T,c, te, x))), from the root Q A 
to a leaf decomposes the tree into the subtrees ta, ty, and te. ÁA a 


The synthesis problem is 
specified by three recursion Ast , — G (0,0) t 
schemes. The recursion scheme | f : G s Nil Ti 
; G s Node(a,l,r)— G (La (G sl))r 
f, on the right, models the kalem iar a pds a 
function mips from Fig. 1. A; is ; iti ý 


the non-terminal corresponding to the main function mips and G is an auxiliary 
function. An additional non-terminal L is used to mirror the tuple decomposition 
done by the let-binding in the code of mips. 
The second recursion scheme is 
the representation function r from Ar Top — Nil 
ri 


paths to trees. The input path A, Zip( ,a,t, 2) = Node(a,t, Ar z) 
is recursively decomposed by the App Ste) ~= Nodela, As #;t) 
rewrite rules, and Node is constructed recursively on the right or on the left 
depending on the first value contained in the Zip constructor. 
The last recursion 
scheme specifies the recur- i As Top 020 
S[80, 91; gr] 


sion skeleton of the tar- As Zip(T,a,t, z) => ga (Az t) (As z) 
get function with un- Ag Zip(L,a,t,z) > gr a (As t) (As 2) 
knowns so, gı and gr. It traverses the input path, making recursive calls (Ag z) 
on paths, and calling the reference function on subtrees (Ay t). The goal is then 
to synthesize implementations of so, gı and gr such that S[so, g1, gr] is equivalent 
to for. 


1 This example is from [24], which calls this data type zipper. 
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4 Recursion-Free Approximations 


A system of recursion-free equations models an approximation of the full func- 
tional specification W for a recursive synthesis problem instance. 


Definition 2. Given two sets of terminals X and =, a system of recursion-free 
equations is a finite set of constraints {e; = e} where e,e' € T(X U £, Vz). 


We denote by {e; = e; Jier the set of constraints of the system, and {x;}i<j<n = 
Uier FV (ei) U FV (e;) are the free variables in the system. The above system 
defines a synthesis problem where X is the signature of some theory T and = is 
the set of unknowns to be synthesized. A solution Z to this synthesis problem 
is a mapping from = to function definitions. Z is valid iff the following formula 
is valid: 
Vo : Di... .V£n : Dn. N\ (ei = e) [£ /Z] 
ie 

where (e; = e,)[&/Z] denotes the term in which the unknowns £ have been 
replaced by their definition in Z. In the rest of the paper, we consider systems of 
recursion-free equations where the set of terminals X and the set of unknowns = 
are fixed and the same as in the main synthesis problem of Sect.3. We say that 
a system €’ is a sound approximation of a system E (€’ = €) (or the synthesis 
problem W) when any solution of € (or W) is also a solution of €’. 


4.1 Partially Bounded Quantification 


Consider the formal definition of the synthesis problem in Sect. 3. Bounding the 
quantifiers consists in expressing the problem on a finite set of bounded terms. 
This bounding effectively eliminates recursion; recursive calls can be inlined a 
bounded number of times. Yet, since the free variables of the bounded term are 
universally quantified over an infinite base domain, a bounded term t of type 0 
represents an infinite set of concrete inputs (of bounded size). 

We propose a different strategy for bounding the quantifiers: we aim to 
instantiate the quantifier on a finite set of bounded and unbounded terms such 
that the resulting specification is not recursive. To start, we instantiate the uni- 
versal quantifier by a finite set of arbitrary symbolic terms T. Our first approx- 
imation then becomes the set of constraints: 


E(T) = {S[E](¢) = (for)(@) tE T} (1) 


The set of constraints E(T) can be seen as a synthesis problem where free 
variables are universally quantified and the unknowns in = are to be synthesized. 
E(T) is not guaranteed to be a system of recursion-free equations for all choices 
of T. For an arbitrary symbolic term t, calls to recursive functions may appear in 
subterms of S[=](¢) and (f o r)(t). Restricting T to bounded terms would yield 
a recursion-free system after symbolic evaluation of both sides of the equation. 
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This, however, is too restrictive. There may exist unbounded terms t where 
the equation S[Z](t) = (f o r)(t) can be rewritten to an equivalent recursion- 
free equation. Intuitively, in an applicative term (resulting from the symbolic 
evaluation of a recursive function f) the simple subterms of the form f(x) where 
x is a variable can be eliminated by replacing f(x) with a single variable a of 
type D which now stands for the result of the invocation of f on any z. 


Definition 3. A symbolic term t is maximally reducible (t is a MR-term) by a 
recursion scheme P = (1,N,R, A) iff [t]p is an applicative term in T(S, N,V) 
such that replacing all subterms of the form (A x) (where x € V) by a fresh 
variable x’ ¢ FV(t) yields a symbolic term. 


Example 2. The term z = Zip(T,a,t, Top) where a is an integer and t is of type 
Tree is maximally reducible by f or and S[s0, 91, gr] (cf. Example 1). First we 
have r(z) = [z],r = Node(a,t, Nil) and (f or)(z) = G (La (Ay, t)) Nil. If Ay t 
is replaced by (a1, a2) (of type int x int), then the term can be reduced further 
to (a, + a, maz(aı + a,a2)). For the other function, we have S[s9, gi, 9r|(z) = 
gı a (Af t) so. If Ay t is also replaced by (a1, a2), then the term reduces to the 
symbolic term gı a (a1,a2) so. Note that z is an unbounded term, since t is a 
variable representing a tree of arbitrary depth. 


If every term in T is maximally reducible by both (f o r) and S[5], then 
every call to a recursive function can be eliminated in E(T). Note that this 
new sufficient condition for E(T) to be recursion free is strictly weaker than the 
condition of having the terms in T to be bounded; a maximally reducible term 
need not be a bounded term. 


Definition 4. A set of constraints E(T) = {S[2](t) = (f or)(t) |t € T} is 
well-formed iff every t E T is maximally reducible by for and S|E]. 


A well-formed set of constraints E(T) can be transformed to a system of 
recursion-free equations. For each free variable x : 0 in E(T), a fresh variable 
a : D is added and the subterms (f o r)(x) and S[&](x) are replaced by a in 
every constraint. We call this rewriting step recursion elimination over D. Note 
that the calls to for and S[=] are both replaced by the same variable, since 
their equivalence is part of the specification of the synthesis problem. 

The transformation described above produces a recursion-free system of equa- 
tions, but it does not always yield a sound abstraction, specifically when f or 
is not onto D. There may exist a solution of ¥ that is not a solution of the 
resulting system of equations. This can be fixed by having additional constraints 
(invariants) on the fresh variables. Let Imy : D — bool a predicate such that 
for is onto {c | c: DA Im;(c)}. Then, the abstraction is sound if the choices 
for a: D are limited to when Im,(a) holds. 


Example 3. Recall Example 1. The maximum in-order prefix sum is not onto 
int x int, since the second element of the pair is always a positive integer. The 
constraint Im,(x,y) = y > 0 is required to make the function onto. In Example 2, 
az must be a positive integer. 
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Definition 5. Let T be a set of maximally reducible terms by f or and S|E], 
and Imp a predicate such that f or is onto {c | c : DA Im,(c)}. We denote 
by E(T) the equation system obtained by rewriting each constraint in E(T) to a 
recursion free equation, through recursion elimination over {c | c: D A Ims(c)}. 


In the synthesis problem defined by €(T), the variables introduced by recur- 
sion elimination are universally quantified over their restricted range. The exact 
encoding of the range restriction by Imp depends on the implementation of a 
synthesis oracle. 


Proposition 1. Z is a solution of E(T) iff Z is a solution of E(T). 


The proof follows from the construction of E(T) based on F(T’). Combining this 
with the fact that E(T) results from bounding the universal quantifications in 
W, we can conclude that E(T) approximates W. 


Theorem 1 (Sound approximation). If T is a set of mazimally reducible 
terms by for and S[Z], E(T) is a sound approximation of ©. 


By construction, any solution of the functional specification ¥ is a solution 
of the system of equations €(T). 


Example 4. Let T = {Top, Zip(7, a, t, Top), Zip(L, a, Nil, z)} be a set of terms, 
where a: int, t : Tree and z : Path. Top is a concrete term, therefore maximally 
reducible. We saw in Example 2 that Zip(T,a,t,Top) is a MR-term. With a 
similar reasoning, one can conclude that Zip(L,a, Nil, z) is a MR-term; note 
how the term differs in which subterm is unbounded depending on the first 
component of the Zip. Therefore, E(T) is a well-formed set of constraints and by 
substituting Ay t and Ag z for (a1, a2) (where a; : int and ag € {v : int|v > 0}), 
we obtain the following recursion-free system of equations: 


0, 0 = S80, 
E(T) = $ a, +a, maz(aı + a, a2) = gı a (a1, a2) So 
a, +a, maz(aı + a, a2) = gr a So (G1, a2) 


with free variables a : int, a, : int and ag E {v : int|v > 0}. 


In contrast to a canonical CEGIS setting, where the approximation is the 
specification projected over a finite set of concrete terms, our abstraction is 
over an infinite set of concrete terms represented by a finite set of symbolic 
terms. In the original functional specification, the equational constraint (f o 
r)(x) = S[Z](x) ranges over all possible terms x of type 0. In the abstraction 
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€(T), the universally quantified variables are the free variables of the terms in 
the equations, which correspond to the variable symbols of scalar type used in 
the symbolic terms of T, modulo the introduction of fresh variables during the 
rewriting of the set of constraints E(T) to the system of equations €(T). 


4.2 Refining Systems of Equations 


Our approximation, the system of equations €(T), is parametric on a set of 
maximally reducible terms T. This approximation can be refined by adding terms 
to T, since for any two set of terms R and T such that R C T, E(R) 2 E(T). 

The convergence of the refinement process depends on the terms added at 
each step. We present our refinement algorithm in the next section, but the 
main insights behind it, not tied to specific algorithmic choices, are captured by 
Propositions 2 and 3. 


Proposition 2. Let T be a set of MR-terms and Z be a solution of E(T). Then 
for any term t' such that there exists t € T s.tt = t', Z is a solution of E(TU{t'}). 


This proposition implies that if Z is a spurious solution, then a counterex- 
ample term showing that it is not a solution of W is necessarily not expanded 
from a term in T. We also learn that T should ideally be an antichain of > at 
every refinement round, since adding expanded terms does not strengthen the 
approximation. 


Proposition 3. Given two terms t and t such thatt > t (i.e. t is an expansion 
of t) and a set of MR-terms T such that Yx € T,=(x > tAt = x), we have 
E(T U{t}) S E(T U {t'}). 


Adding the less expanded term (i.e. t) yields both a more general approximation 
and a more compact one. In other words, given a choice, always choose the least 
expanded term as the counterexample for refinement. 


5 Synthesis Algorithm 


Our synthesis algorithm computes a sequence of approximations of the functional 
specification W from Sect. 3. Each approximation is a system of equations of the 
form €(T) (Definition 5). The approximations are incrementally refined until 
the synthesis solution for one is also a valid solution for the synthesis problem 
specified by W. 
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Figure2 illustrates the (Sec. 5.1) 
work flow of our algo- Expand 
rithm. At the beginning of uc > T’,U’ 
each iteration, a solution of initialize 7, U asfi Eue] 
the system of recursion-free 
equations E€(T) is synthe- ; ieee a 
sized. If no solution is found, ass Generalize 
then there is no solution for aki žo ug EU 
the original synthesis prob- P Gek zo| 
lem, since the €(T) is guar- 
anteed to be a sound approx- Synthesize £ (T) EN Verify Z 
imation (Theorem 1). If a I lj; 
solution Z is found, then Z : 

No solution. Solution Z. 


is verified against W and if 
it passes, then it is returned 
as a solution. Otherwise, the 
verifier returns a counterexample term xc. By Proposition 2, zc cannot be an 
expansion of any term in T, and new terms related to xo have to be added to 
T in the spirit of refinement. 

The algorithm additionally keeps track of a set U of non-maximally reducible 
terms, which intuitively represents the set of inputs not covered by the current 
approximation. The sets T and U are complementary in a precise sense: TUU is 
always a boundary of =. A boundary (of a partial order) is an antichain C such 
that for any bounded term t, there is some c in C such that c > t. 

The counterexample xc is necessarily an expansion of some term uc €E U. 
But since uc is by definition not maximally reducible, one cannot just remove 
it from U and add it to T. The Expand step takes uç as an input and produces 
two sets T’ and U’ to update the current sets T and U and repair the boundary 
before the loop restarts. 

The figure on the right is a graphical repre- 
sentation of the boundary repair. The sets T (in 
blue) and U (in red) initially form a boundary. 
This boundary is updated by removing the term 
uc and adding U’ and T” (the results of the Expand 
step) to form a new boundary. The fact that TUU 
always forms a boundary is a required invariant . 
of this refinement loop: (i) T, as a parameter of ae 
€(T), is required to be an antichain (as discussed 
in Sect. 4.2), and (ii) the Generalize step relies on the assumption that U is an 
antichain containing all the terms not yet sufficiently expanded to be in T. 

We rely on existing tools/techniques for the steps Synthesize and Verify of 
Fig. 2. In the following, we describe the Initialize, Expand, and Generalize steps of 
the algorithm. 


Fig. 2. Approximation refinement algorithm. 


Counterexample-Guided Partial Bounding for Recursive Function Synthesis 843 


Initialization. There is a straightforward way to initialize T and U: apply the 
Expand component to a single variable x of type 0 and take the resulting sets 
T of maximally reducible terms and U of non-maximally reducible terms. The 
Expand step is described in the next section. For Example 1, a variable x of type 
Path is expanded to produce T = {Top} and U = {Zip(1,a,t, z), Zip(T, a,t, z)} 
with variables a, t, and z of the appropriate types. 


5.1 Expand : Producing Maximally Reducible Terms 


Given an input term uç, Expand generates two sets T” and U’ such that the terms 
in T’ are maximally reducible by both f or and 
S|]. The computation of these terms is done by 
expanding the input term uc until a set of maxi- 


T’ =0,U' = {uc}; 
while T’ = ý do 


; 7 Pick uo in U’; 
mally reducible terms is found. The algorithm on S = ExpandOnce(uo); 
the right illustrates the process. At each step, a T',U" = Partition(S); 
term ug is picked from the set of non-maximally U’ = (U' \ uo) UU"; 


reducible terms U’. This term is expanded once, | end 
by a call to EXPANDONCE (which is described | return T’, U’ 
later). The resulting set of terms is then parti- 
tioned into a set of maximally reducible terms T” 
and a set of non-maximally reducible terms U”; the latter is used to update U’. 
The choice of ug at the first line of the loop is important for the termination 
of the algorithm. There may be an infinite sequence of expansions if the ug’s are 
adversarially chosen. There always exists a finite sequence of expansions yielding 
bounded terms which are by definition maximally reducible. A breadth-first 
exploration of all expansions is one such strategy that ensures the termination 
of the algorithm. 


ExpandOnce. The input of ExpandOnce is a term uo that is not maximally 
reducible. The following proposition characterizes ug and the reason for its non- 
reducibility: 


Proposition 4. Let up € T(X,V) and g = (X, N, R, A) a recursion scheme. ug 
is not maximally reducible by g iff there exists a subterm of [uo], of the form 
s=Ft, ...tn x, where FEN and F £ A, the terms tı ...tn are applicative 
terms, and x € FV (uo). 


The proof by cases on the subterms of ug is given in the extended version of 
this paper [7]. In order to take a step towards making wp) maximally reducible, 
the variable x needs to be expanded. Expanding x into a term guarantees some 
rule F 2, ...%2n p — t E R can be used to reduce uo further. Such a rule is 
guaranteed to exist for a recursion scheme representing a total function. 

Next, we define how ug is expanded at a variable x identified by Proposition 4. 
ug can be written as C[z] for some one-hole context C| ]. Assume the type 8 
of x has constructors &1,...4, where each «i has type yi — p. The pointwise 
expansion of ug at x is the set of terms {C[K1(21)],...,C[Kn(an)]} where each 
x; is a variable (or a tuple) of variables of type yi. 
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In summary, ExpandOnce first identifies a variable x in uo (Proposition 4) 
that needs to be expanded and then performs the pointwise expansion of uo at 
x and returns the resulting set of terms. 

One important feature of ExpandOnce is that terms are expanded only where 
needed. Proposition 4 identifies the precise location (i.e. x) where expanding is 
necessary and ignores locations where it is not. 


Example 5. Recall Example 1. Suppose ug = Zip(T,a,t, z) is a (symbolic) path 
and an input to ExpandOnce, where a is an integer, t is of type Tree, and z 
is of type Path. uo is not maximally reducible and has to be expanded. Note 
that r(uo) = Node(a,t, A, z) and therefore (f or)(uo) = G (L a (Ap t)) (Ar 2). 
The subterm (A, z) blocks the reduction of the term starting with G, because 
z blocks the reduction of A, z and therefore, uo has to be expanded at z. The 
pointwise expansion of ug at z yields the terms u,; = Zip(T,a,t,Top), u = 
Zip(T,a,t, Zip(T,a’,t’,z’)), and u3 = Zip(T,a,t, Zip(L,a’,t’, z’))}. Note that 
the tree element t need not be expanded; we showed in Example 2 that u1 is 
maximally reducible and therefore, the expansion loop stops and returns T’ = 
{ui} and U’ = {u2, u3}. 

Consider the symmetric term Zip(1,a,t,z) acquired by replacing the T in 
uo with L. The expansion of this term yields T’ = Zip(L,a, Nil, z) and U’ = 
{Zi(L, a, Node(a’,1,r), z)}. Note that unlike the case for uo, the tree element of 
the path has to be expanded and the path element need not be expanded. 


5.2 Counterexample Generalization 


The generalization of the counterexample xc is the unique term uc € U such 
that uc = xc. The term uc is guaranteed to exist because the algorithm main- 
tains the invariant that TU U is a boundary, and it is unique since U is always 
an antichain. 


Example 6. After initialization, the synthesis solver attempts to find a solution 
for the system of equations given in Example 4. One possible solution is 


so = (0,0) gila, (s1, M1), ($2, M2)) = a + s1, maz(mı,a + s1) 


together with a similar solution for gp. But the solution for g is incorrect; 
the first component should be a + sı + s2 (i.e. the sum of both partial sums 
and the label of the node). The verifier returns a counterexample of the form 
Le = Zip(T,1, Node(?), Zip(T,—2, Node(?),?)) where the question marks stand 
for concrete subterms of the appropriate type. These subterms are ignored. The 
counterexample is generalized by selecting uo = Zip(T,a,t, Zip(T,a’,t’, z’)) 
(where u = xc), the term that was stored in U after the expansion described 
in Example 5. This determines where the algorithm must unfold the path one 
more time to build a stronger approximation. 

We report in Sect.7 that SYNDUCE succeeds in finding a solution for this 
example with 3 refinement rounds in 1.57s, whereas the symbolic CEGIS 
(described in Sect. 1) times out after 10 min over 6 refinement rounds. 
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Soundness. Under the assumption that the steps Synthesize and Verify are 
soundly implemented, the overall algorithm is sound. By construction, T is 
always a set of maximally reducible terms. Therefore, €(T) is a guaranteed to 
be a sound approximation of W by Theorem 1. The soundness of the verification 
oracle guarantees that any returned solution is in fact a solution of the synthesis 
problem specified by Y. 


Weak Progress. Consider the naive algorithm that would expand T by simply 
adding the counterexample xc to it; zo is a maximally reducible term after 
all. This naive algorithm satisfies a weak progress property, namely that, the 
spurious solution Z from any round will not be a solution in any future round. 
Our algorithm does something more sophisticated and therefore it has to be 
argued that the same weak progress property holds. First, Expand satisfies the 
following property that guarantees T UU to always be a boundary: 


Proposition 5. Lett be some symbolic term and T’,U’ be the results of the call 
to Expand(t). Then T’ U U’ is a boundary of the set {t't = t}. 


Let uc be the generalization of xo. Proposition 5 guarantees that Expand 
computes and adds all possible expansions of uc to T. This in turn implies 
that there always exists a term t > xc in the updated set T (after the call to 
Expand), which rules xc out as a spurious solution in all future rounds. Note 
that the algorithm relies on the existence of uc in U. For this, it requires TUU 
to be a boundary. 


Parsimony. Finally, we can show that our algorithm is parsimonious with the 
selection of the terms for T in the following precise way: 


Theorem 2. /[Parsimony] Let us assume (T,U) is a boundary that our algo- 
rithm reaches in some round, then (T,U) is optimal in the following two senses: 


— for everyt E TUU there is no MR-term t’ such that t > t. 
— there is no non-empty subset T’ of T and set U’ such that (T\T')UU' is a 
boundary and E(T \ T") X E(T). 


Intuitively, all the terms in T are expanded to the extent necessary and 
no proper subset of T can form a boundary that maintains the same precise 
approximation that T U U induces. The full proof appears in [7]. 


6 Implementation 


Our approach is implemented in SYNDUCE [36], a tool written in OCaml [22], 
and the inputs are recursive functions and datatypes written in Caml. 
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6.1 Verification and Synthesis Oracles 


SYNDUCE uses bounded model checking to implement Verify from Fig.2. A 
bounded check for the validity of a synthesis solution Z is encoded as the validity 
of the formula AjerVa € FV (t).S[5/Z](t) = (f or)(t) for a set of bounded terms 
T. Z3 [25] is used as the backend SMT solver, which produces a counterexample 
in the form of a term for which at least one equality constraint is invalid. 

SYNDUCE spends most of its time in the Synthesize box of Fig. 2. Since the 
input to Synthesize is guaranteed to be a recursion-free synthesis specification, 
any off-the-shelf syntax-guided synthesis (SyGuS) [4] solver that supports the 
standard language [29] can be used to implement Synthesize. We use CVC4 [5] 
for the results presented in this section. 

A SyGuS problem is specified by a grammar describing the space of programs 
to be synthesized and a set of constraints. In this case, the grammar is generated 
from the type of the functions to be synthesized (the unknowns in =), which 
can be inferred from the constraints where they appear. Instances of generic 
grammars for integers and booleans can be found in the SyGuS language spec- 
ification [29], and these grammars for base types can be combined into tuples 
in a straightforward manner. The constraints are the equations of the system, 
with the addition of the predicates constraining the domain of the variables, i.e. 
Im, from Definition 5. Each recursion-free equation e = e’ is translated to a 
constraint of the form >(Nverv(e\uFv(e’) Imy(v)) V e =e’ where Im,(v) is the 
predicate associated to the variable v. 


6.2 Baseline Method 


The goal of our experimentation is to evaluate the efficiency and efficacy of the 
proposed partial quantifier bounding approach for synthesis of recursive pro- 
grams. Since there is no available (automated) tool that solves the specific prob- 
lem posed in this paper, we implemented the symbolic CEGIS technique (as 
outlined in Sect. 1) to serve as a baseline. To be precise, the algorithm of Fig. 2 
is modified by removing the Generalize and Expand steps; the symbolic coun- 
terexample returned by the verification at each step is added directly to the set 
of terms instead of being generalized. The set T is also initialized as a set of 
bounded terms of some minimal depth, depending on the particular definition 
of the data type. Note that since the baseline method is counterexample-guided, 
it is better than the more straightforward finitization techniques, for example, 
manual finitization by a preset bound. 

We also implemented the concrete CEGIS method (outlined in Sect.1) to 
confirm that the symbolic CEGIS is the better choice. Symbolic CEGIS solves 6 
more benchmarks than concrete CEGIS, and does better time-wise in the vast 
majority of the rest. Detailed results are given in the extended version of this 


paper [7]. 
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6.3 Optimizations 


We implemented a few simple, straightforward and generic (i.e. they can be 
incorporated in any SyGuS solver) optimizations. These aim to compensate for 
the brittleness of the SyGuS solvers, which can fail for very simple constraints 
for no good reason. Here is a brief overview of these optimizations, which are 
applicable to any system of equations (baseline’s and ours): 


— Syntactic definitions, which are those that define an unknown function € 
unequivocally in the form of €(21,...,2%,) = t, can be identified quickly and 
eliminated from the synthesis task to simplify it. 

— A system of equations can be split into independent subsystems by identifying 
an independent subsets of equations. A subset of equations is independent if 
it constrains a subset of the unknowns that does not appear in the rest of the 
set of equations. Identification of independent subsystems generates simpler 
subproblems. 

— Instead of starting from a default initial state, we can start from a set of 
terms that makes for an interesting first round and consequently saves a few 
refinement rounds from the solution. We form a set of initial terms by using 
the Expand routine to expand enough terms such that each unknown appears 
in at least one constraint in the approximation for the first round. 


These optimizations are applied to both the baseline method and our algo- 
rithm for the purpose of evaluation. The extended version of this paper [7] 
includes more detailed evaluation of them and experimental results illustrating 
their precise impact on each algorithm. 


7 Evaluation 


We evaluate SYNDUCE on a broad set of benchmarks. Our benchmarks are 
grouped into six categories. Table 1 lists all the benchmarks, grouped accord- 
ingly. Each category, shares the same representation function and polymorphic 
recursion skeleton, but a different reference implementation is used to specify 
the synthesis problem. The recursion skeletons (and the representation func- 
tions) are polymorphic and therefore reusable. Only 9 different skeletons and 
4 different representation functions were used across our 43 benchmarks. More 
details about the benchmarks, including the simple 9 utilized skeletons, appear 
in the extended version of this paper [7]. 


7.1 Case Studies 


Changing Tree Traversals. An example of this category is the mips example 
used in the introduction. The reference function is a natural implementation 
of a function with a post- or in-order traversal of a binary tree. The target 
is an equivalent implementation corresponding to the divide-and-conquer tree 
homomorphism style recursion. 
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From Trees to Paths. A tree path (zipper in [24]) is a data structure used 
to represent a tree together with a subtree that is the focus of attention. Our 
running example belongs in this category. The other benchmarks in this category 
are from [24]. 


Enforcing Tail Recursion. In this category, the reference implementation 
is a direct-style recursion on the data structure, while the recursion skeleton 
specifies that an accumulator should be used to make the function tail-recursive. 
Tail recursive functions generally compile to more efficient code. 


Combining Traversals. Suppose a collection of existing implementations com- 
putes different functions with different traversals of the same data structure. If in 
some larger context all of these functions need to be computed, combining them 
can lower the amortized cost. In this set of benchmarks, we synthesize automati- 
cally the implementation that corresponds to traversing the data structure with 
a single recursion strategy, combining the computations into one. 


Tree Flattening. These benchmarks target the synthesis of an implementation 
on the more complex plane tree data structure from a reference implementation 
on the simpler binary tree data structure. 


Parallelizing Functions on Lists. Parallelizing a function on lists can be 
seen as the translation of a recursive function on cons-lists to a homomorphic 
function on lists built with the concatenation operator. These benchmarks are 
from [8,9,23]. 


7.2 Experimental Results 


To best of our knowledge, there are no available tools that can be directly com- 
pared against SYNDUCE. We can transform our specification to a format that 
can be accepted by LEON [18]. However, the latter does not succeed in solving 
even the simplest of our benchmarks (e.g. sum in the list function parallelization 
category), likely due to the fact that the required deductive rules are missing. 
We comment on the rest of the available tools in Sect. 8. 

Table1l presents the results of comparing SYNDUCE against the baseline 
method. Both techniques use symbolic counterexamples, and therefore, the com- 
parison can highlight the performance impact of our partial bounding algorithm. 
The most important point of comparison is the overall synthesis time. In 9 out 
of 43 benchmarks, the baseline method times out. In another 5 cases, it outper- 
forms the baseline by two orders of magnitude. In the easiest of the benchmarks, 
i.e. when the overall synthesis time of the baseline is in tens of milliseconds, the 
two methods are equally good within a small margin of error. The bold number 
in each row highlights the fastest synthesis time. 

Amongst the 9 benchmarks for which the baseline algorithm times out, 7 are 
cases where SYNDUCE takes advantage of partial bounding by leaving some quan- 
tifiers unbounded. The baseline algorithm in these cases requires more terms and 
terms of higher complexity in the finite approximations. Two of the 9 benchmarks 
(post-order mps and sum + mts + mps ) are cases where the set of maximally 
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Table 1. Experimental Results. Benchmarks are grouped by categories introduced in 
Sect. 7.1. # steps indicates the number of refinement rounds. Trast is the elapsed time 
before the last call to the SyGuS solver in the last refinement step before timeout. 
All times are in seconds. The best time is highlighted in bold font. A ‘-’ indicates 
timeout (> 10 min). The “Inv” column indicates if codomain constraints were required. 
Experiments are run on a laptop with 16G memory and an i7-8750H 6-core CPU at 
2.20 GHz running Ubuntu 19.10. 


Pisce Beúchmark TA SYNDUCE Baseline Method 

‘| time |# steps Tiast || time |# steps) Tiast 

sum no | 0.03 2 0.01 || 0.04 3 0.02 

max no | 0.33 1 0.00 || 0.34 2 0.01 

max 2 no | 0.25 1 0.00 || 0.34 2 0.01 

Changing min no | 0.23 1 0.00 || 0.32 2 0.01 
Tree min-max no | 0.85 3 0.15 || 73.16 3 0.06 
Traversals | max weighted path | no} 0.09 3 0.03 || 0.07 3 0.02 
sorted in-order no | 0.01 1 0.00 || 43.97 4 1.98 

pre-order poly. no | 16.09 2 0.06 - 4 0.97 

mips yes| 0.29 2 0.04 - 4 2.70 

in-order mts yes| 0.41 2 0.04 - 4 4.84 

post-order mps yes |132.14 4 82.56 - 6 39.29 

sum no | 0.07 2 0.02 || 0.06 3 0.02 

From height no | 0.90 1 0.00 || 1.24 5 0.43 
Tree to max weighted path | no} 0.15 2 0.03 || 0.12 3 0.03 
Path max w. path (hom) | no} 0.01 1 0.00 || 1.42 4 0.69 
leftmost odd no | 0.01 1 0.00 - 4 0.27 

mips yes| 1.57 3 0.50 - 7 = (322.45 

Enforcing sum no | 0.02 2 0.01 || 0.03 3 0.02 
Tail mts no | 5.86 2 0.02 ||115.58) 3 0.06 
Recursion mps no | 1.68 2 0.02 || 0.34 3 0.03 
Combining mts + sum no | 9.71 2 0.02 || 5.42 3 0.03 
Traversals sum + mts + mps |yes| 0.26 3 0.12 - 3 0.04 
sum no | 0.07 3 0.04 || 0.07 2 0.01 

Tree product no | 0.07 2 0.01 || 0.16 2 0.01 
Flattening max of heads no | 0.21 2 0.02 || 0.18 3 0.03 
max of lasts no | 0.21 2 0.02 || 0.33 3 0.03 

max sibling sum no | 5.26 2 0.03 || 2.72 3 0.04 

sum no | 0.08 1 0.00 || 0.30 3 0.04 

sum of even elts. no | 0.10 1 0.00 || 0.39 3 0.04 

length no | 0.07 1 0.00 || 0.22 4 0.05 

last no | 0.01 1 0.00 || 0.03 2 0.01 

Parallelizing product no | 0.07 1 0.00 || 0.31 3 0.04 
Functions polynomial no | 0.07 1 0.00 || 0.71 5 0.10 
on hamming no | 0.10 1 0.00 || 0.46 3 0.04 
Lists min no | 0.02 1 0.00 || 0.08 2 0.01 

is sorted no | 3.45 2 0.11 || 3.12 4 0.14 

linear search no | 0.08 1 0.00 || 0.35 3 0.04 

line of sight no | 0.86 2 0.09 || 7.67 4 0.34 

mts yes| 0.10 1 0.00 || 4.80 4 0.08 

mps yes | 0.09 1 0.00 || 4.73 4 0.08 

mts and mps combined| yes | 0.38 2 0.11 ||210.84) 6 36.77 

mss yes| 4.82 3 1.53 - 6 24.23 

count max elements | no |138.20 1 0.00 - 3 0.46 
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reducible terms is exactly the set of bounded terms (i.e. one cannot take advan- 
tage of partial bounding), but SYNDUCE still outperforms the baseline because it 
adds smaller terms to the abstraction through generalization and produces less 
complex problems for the backend synthesis oracle. In summary, both counterex- 
ample generalization and the partial bounding yield big practical advantages in 
comparison with the baseline symbolic CEGIS algorithm. 

It is noteworthy that whenever an instance is hard, the majority of the time 
is spent in the Synthesize step. This becomes nearly 100% of the time for the 
baseline algorithm whenever it times out. The weakness of the baseline method 
lies in the fact that the recursion-free instances generated by it are too difficult 
to solve by the backend solver. The timeout occurs within a few refinement 
rounds (at most 7) when the baseline algorithm gets stuck in the Synthesize step 
attempting to solve a prohibitively difficult recursion-free synthesis instance. 

Across all benchmarks, our algorithm generally requires fewer refinement 
rounds than the baseline method. The few exceptions are the cases where the 
synthesis oracle gets lucky in producing a good solution when the target pro- 
grams are very simple, for example in the case of the sum and product bench- 
marks of the flat tree category. 

Finally, to isolate the precise contribution of the partial bounding idea, we 
evaluated the effect of each optimization on each algorithm. The applicability of a 
particular optimization highly depends on the particular set of constraints, which 
in turn depends on the specific benchmark and the algorithm (ours vs baseline) 
producing the constraints. Our synthesis algorithm yields more general and more 
succinct constraints, to which the optimizations are more often applicable. Of 
the 9 cases where Synduce succeeds and the baseline method times out, 7 are 
due to the inapplicability of these (simple) optimizations. SYNDUCE outperforms 
the baseline algorithm with all optimizations turned off for both. The detailed 
results are given in the extended version of this paper [7]. 


8 Related Work 


Synthesizing recursive programs is a challenging task, and several automated 
techniques have tackled the problem with different specifications of the problem 
and different approaches to the solution. 

Finitization, for example by bounding the depth of unbounded inputs or 
the number of recursive calls or loop iterations, is a straightforward way of 
dealing with unboundedness in synthesis [4,37] and verification [10]. In [32,33], 
high-level synthesis techniques use domain specific knowledge to finitize input 
programs. Quantifier instantiation, i.e. replacing quantified terms with ground 
terms, is commonly used in theorem proving and verification, and has also been 
useful in synthesis [31]. Our proposed algorithm can be viewed in the spirit of 
quantifier instantiation, with the major difference that (universally) quantified 
terms are replaced with other (universally) quantified terms which are still over 
an unbounded domain, yet with fewer degrees of freedom in unboundedness. 
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Synthesis through Program Transformation. Our precise problem state- 
ment is inspired by the transformation system developed by Burstall and Dar- 
lington [6]. They set to automate the task of transforming an initial program 
specified as a set of first-order recursion equations into a more efficient program, 
by altering the recursive structure. Their approach is based on transformation 
rules and semi-automatic. They use specific rules, e.g. associativity of a data 
operation, to perform the transformations and such rules do not generalize well. 
We defer the reasoning about the operations on the data to an SMT solver, and 
therefore need not rely on such rules. Techniques based on program transforma- 
tion have been applied to the synthesis of special classes of recursive programs 
before [13,15]. For example, the work in [1] focuses on tail recursion and a lot of 
attention has been given to producing divide-and-conquer recursions in the way 
of automated parallelization [2,8, 23]. 


Synthesizing Recursive Functional Programs. Inductive techniques were 
developed to construct recursive programs from input /output examples [35], and 
this approach has been extended in more recent work [16,17]. The latter two are 
examples of an analytical approach to program synthesis in which programs are 
constructed from the analysis of examples. Other recent approaches are search- 
based methods. ESCHER [3] synthesizes recursive functions from user-provided 
components by interactively asking for more examples from the user. \? [11] 
synthesizes data structure transformations from input/output examples using 
higher-order functions. 

Tools like A? and ESCHER can be complementary to SYNDUCE in a more 
general context of recursion synthesis. The user can try to synthesize an imple- 
mentation of a recursive function over a simple data type using \? or ESCHER 
using input/output examples with a higher chance of success. This then serves 
as the reference implementation input to SYNDUCE which can aim for a more 
sophisticated implementation over a more complex recursive datatype. 

MYTH [27], MyTH2 [12] and SyNQuID [28] use type information to direct 
the search for a program satisfying a specification. In MYTH, this specification 
is a set of input/output examples. MyTH2 generalizes this approach by treat- 
ing examples as limited types. The specification for SYNQUID is a polymorphic 
refinement type, and the tool synthesizes an implementation of the given type 
using components provided by the user. Type-based approaches work well within 
the expressivity of refinement-types as specifications, but refinement types can- 
not express constraints for all desired synthesis tasks. Our specification is strictly 
stronger than both input/output examples and refinement types. 

In SYNTREC [14], reusable templates are used to facilitate the synthesis of 
algebraic data type (ADT) transformations. The reusable templates are meant to 
lessen the burden of the user in specifying the search space of the programs to be 
synthesized every time. The recursion skeletons in our framework are effectively 
(reusable) polymorphic recursion templates. The user can be provided with a 
library of common recursive datatypes with representation functions mapping 
between these types, and useful recursion skeletons on these datatypes. SYN- 
TREc [14] synthesizes ADT transformations from a functional specification. In 
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contrast, our tool takes this transformation as input (the representation func- 
tion) and synthesizes a function from ADT to a base type. 

LEON [18], a deductive verification and synthesis framework, can synthesize 
recursive functions from first-order specifications with recursive predicates. In 
Sect. 7, we commented on a comparison of LEON against SYNDUCE. 


Higher-Order Recursion Schemes. We use recursion schemes as a model for 
our programs, but our contribution has very little to do with the original work 
introducing this model. Higher-order recursion schemes have been introduced 
for model checking functional programs [19—21,30]. Pattern matching recursion 
schemes, introduced in [26], provide a model for functional programs that manip- 
ulate ADTs. We use them as an accurate description of a class of functions on 
ADTs and the notion of reduction associated with them as a crisp way of for- 
mulating symbolic evaluation. 


9 Discussion and Future Work 


We have demonstrated that partial bounding of quantifiers can be a power- 
ful tool for the synthesis of recursive programs. Circumventing the unnecessary 
bounding of some quantifiers leads to simpler instances of recursion-free synthesis 
subtasks that can be handled by the current tools. Moreover, our counterexam- 
ple generalization also yields simpler terms for bounding the quantifiers that 
have to be bounded. This is the result of our focus being on a class of recur- 
sive functions that perform structural recursion (i.e. recursion that deconstructs 
its inputs). This, together with our specific problem setup, takes the guesswork 
out of counterexample generalization and provides the means for a constructive 
counterexample generalization scheme which is demonstrably effective. 

The reliance on structural recursion, therefore, limits the class of reference 
implementations and recursion skeletons that can define an acceptable synthesis 
instance in our framework. Another limitation tied to the input model is that 
the output of the recursive functions has to belong to the base (non-recursive) 
types to accommodate the reduction of the problem to one that can be solved 
by a backend solver. Consequently, the unknowns in a target recursion scheme 
have to all be functions from base types to base types. 

In our problem setup, the recursion strategy (given by the recursion skeleton) 
is an integral part of the specification since it is used to communicate program- 
mer intent. Expecting a complete recursion skeleton may be viewed as another 
limitation of our technique. For example, the mts (maximal tail sum) function 
can be computed as function on a list maintaining only one integer value (i.e. 
the current value of the maximum tail sum), yet, to implement mts in a divide- 
and-conquer strategy, another computation, the sum of the elements of the list, 
has to be performed alongside this one. It would be great if the user can ask 
for a divide-and-conquer recursion strategy without having to know that the 
additional computation of sum is required as well. 

Ideally, the user should be permitted to provide an incomplete recursion skele- 
ton which sufficiently communicates the intent and leave the recursion skeleton 
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to be completed automatically by the synthesis procedure. This is a tricky prob- 
lem. There are not only many recursion strategies to choose from, but each choice 
also leads to unboundedly many ways to organize the computation on data. This 
adds yet another dimension of unboundedness to the synthesis problem beyond 
the two already tackled in this paper. Note that in other recursion synthesis 
work such as [3,12,14,28], new operations on data are not synthesized, and in 
contrast drawn from an existing pool of operations. Therefore, this particular 
problem does not apply in those contexts. 

Finally, our method currently does not take into account invariants over 
recursive data types, e.g. an invariant that specifies that a tree is a binary search 
tree. Some properties of the datatypes can be encoded through the representation 
function, e.g. the associativity of the concatenation operator in the category 
of list parallelization benchmarks. Incorporating the more general invariants in 
future work will broaden the expressivity of the framework in handling more 
interesting problems. 
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Abstract. This paper presents PAYNT, a tool to automatically synthe- 
sise probabilistic programs. PAYNT enables the synthesis of finite-state 
probabilistic programs from a program sketch representing a finite fam- 
ily of program candidates. A tight interaction between inductive oracle- 
guided methods with state-of-the-art probabilistic model checking is at 
the heart of PAYNT. These oracle-guided methods effectively reason 
about all possible candidates and synthesise programs that meet a given 
specification formulated as a conjunction of temporal logic constraints 
and possibly including an optimising objective. We demonstrate the per- 
formance and usefulness of PAYNT using several case studies from dif- 
ferent application domains; e.g., we find the optimal randomized protocol 
for network stabilisation among 3M potential programs within minutes, 
whereas alternative approaches would need days to do so. 


1 Introduction 


Probabilistic programs are a powerful modelling language to describe systems 
containing probabilistic uncertainty. Their correctness and efficiency can be 
described as a set of declarative temporal constraints. Various verification tools 
cater for automating their a posterior verification: does a program satisfy a spec- 
ification? Here, we focus on finite-state programs and consider specifications 
given as (conjunction of) temporal logic constraints. The automated verifica- 
tion of such constraints is supported by probabilistic model checkers such as 
STORM [19], PRISM [35] or MODEST [27]. 

These model checkers typically require a fixed program or a fixed model. This 
is not always in line with their intended usage: To keep development costs man- 
ageable and development cycles fast, system designs are preferably verified as 
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Fig. 1. The workflow of the synthesis process. 


early as possible. However, at early design stages not all system details are known 
or they are deliberately left out, and systems or their models are incomplete— 
they contain holes. A hole may e.g., reflect a partially implemented controller 
for a complex system or an unspecified component for wireless communication. 

A key aspect of the design cycle is to explore these designs, i.e., to do design 
space exploration. The verification challenge now is to analyze all combinations of 
fixing the hole with a concrete behavior/subsystem and reveal (Pareto-)optimal 
designs. Alternatively, designs should be robust for engineering choices made 
downstream, e.g., a system should ideally not depend on the specific character- 
istics of a single communication interface. Verifying that every combination of 
options satisfies the specification ensures that changes in available components 
do not need to trigger a redesign. 

The application areas above require to reason about the presence and absence 
of designs (aka: realizations) satisfying a specification in a family of designs. To 
allow for efficient reasoning it is crucial that this family is concisely represented. 
A convenient way to describe such a family is to use sketching [2,45]. A sketch 
can be thought of as a program (or model) with holes, naturally fitting the use 
case outlined above. 

Clearly, enumerating single realizations is unfeasible in the light of the combi- 
natorial design space explosion. Instead, the prevalent approach connected with 
sketching is based on inductive synthesis. The idea is to analyze a single realiza- 
tion and generalize the analysis results to a set of realizations, often using the 
notion of counterexamples. In probabilistic programs, such a notion is challeng- 
ing, as counterexamples are typically complex objects [1]. 

Driven by a range of applications, there has been significant algorithmic 
progress in the analysis of probabilistic program sketches and temporal logic 
constraints over the last years. Baier et al. [14] explored the use of sym- 
bolic model-checking methods so as to consider sets of realizations at once. 
Češka et al. [12] used abstraction-refinement on sets of realizations and com- 
plemented this with a counterexample-guided inductive synthesis approach [11]. 
The latter two approaches have recently been integrated [3] and yield a speed up 
of multiple orders of magnitude over a baseline that enumerates all realizations. 

This paper presents PAYNT! (Probabilistic progrAm sYNThesizer) that 
takes a program sketch, concisely describing a finite family of finite Markov 


1 Available at https: //github.com/gargantophob/synthesis. 
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chains (MCs), and a specification, and finds a family member (aka: realization) 
that (potentially optimally) satisfies the specification, see Fig. 1. The design of 
PAYNT is rooted in oracle-guided synthesis and enables the flexible combination 
of a variety of state-of-the-art algorithms. For efficiency purposes, key algorithms 
are implemented within the STORM [19] model checker that dominated recent 
tool comparisons [24]. To deliver flexibility, the tool is built in a modular fashion 
on top of a python API. To ease the learning curve, the tool takes a conservative 
extension to the widespread PRISM language as input. 

PAYNT aims at two user groups: First, it provides a development plat- 
form for alternative algorithmic approaches, e.g. exploiting recent neurosym- 
bolic approaches to find good designs. The tool provides the interface to define 
sketches and all baseline algorithms under one roof. Secondly, the analysis of 
sets of realizations is a valuable backend for automatic engines, e.g., when syn- 
thesizing finite-state controllers for partially observable MDPs (POMDPs) [33]. 


Related work. The synthesis problems for parametric probabilistic systems can 
be divided into two categories. 

Topology synthesis, akin to aim of PAYNT, assumes a finite set of parameters 
affecting the MC topology. Finding a realization satisfying a given reachability 
property is NP-complete in the number of parameters [13], and can be naively 
solved by analysing all individual family members. An alternative [14] is to 
model the MC family by a Markov decision process (MDP) and use off-the-shelf 
MDP model-checking algorithms. The ProFeat [14] and QFLan [47] tool take 
this approach to quantitatively analyze alternative designs of software product 
lines [23,36]. These tools are limited to small families. To improve the scalability, 
inductive methods based on abstraction-refinement over the MDP representa- 
tion [12], and counter-example guided inductive synthesis (CEGIS) for MCs [11] 
have been proposed. As shown by the Maze model in Sect. 5, the topology syn- 
thesis is closely linked to controller synthesis for POMDPs, a popular model for 
planning in AI under uncertainty. Other recent approaches to POMDP controller 
synthesis include the use of neural network oracles (obtained by reinforcement 
learning) to guide the search [48] and adaptive learning schemes based on imi- 
tation learning [30]. Note that the problem of sketching probabilistic programs 
that fit given data as studied, e.g., in [39,44], is different. 

Parameter synthesis considers models with a fixed topology but with uncer- 
tain parameters associated to transition probabilities (or rates). It aims to ana- 
lyze how the MC (or MDP) behaviour depends on the parameter values. Scalable 
approximate parameter synthesis techniques treat identical parameters in differ- 
ent transitions independently [10,42] and have been implemented in STORM [19] 
and PRISM [35]. Exact approaches construct rational functions for symbolic 
reachability probabilities [16] and were improved in [18,25,29]. This approach 
has been also applied to problems such as model repair [4,40]. 

Both synthesis problems can be attacked by search-based techniques that 
do not ensure an exhaustive exploration of the parameter space. These include 
evolutionary techniques [26,38] and genetic algorithms [22]. Their combination 
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Fig. 2. The server for request processing. 


with parameter synthesis has been pursued in [8] and is implemented in the tool 
RODES [9] to synthesize robust systems. 


2 Using PAYNT 


We exemplify the usage of PAYNT by the following synthesis problem. 

Consider a server for request processing depicted in Fig. 2. Requests are gen- 
erated (externally) in random intervals and upon arrival stored in a request 
queue of capacity Qmax. When the queue is full, the request is lost. The server 
has three profiles — sleeping, idle and active — that differ in their power con- 
sumption. The requests are processed by the server only when it is in the active 
state. Switching from a low-energy state into the active state requires additional 
energy as well as an additional random latency before the request can be pro- 
cessed. We further assume that the power consumption of request processing 
depends on the current queue size. The operation time of the server is random 
but finite. 

The server is controlled by a power manager (PM) that observes the cur- 
rent queue size and then sets the desired power profile. More precisely, the PM 
distinguishes between four queue occupancy levels determined by the threshold 
levels T1, T2, and T3. These values are controllable parameters that denote which 
fraction of the queue capacity is occupied. In other words, the PM observes the 
queue occupancy of the intervals: [0, Tı] , (Tı, T2] etc. For each occupancy level, 
the PM changes to the associated power profile P,,..., P4 € {0,1,2}, where 
numbers 0 through 2 encode the profiles sleeping, idle and active, respectively. 

PAYNT takes as an input a sketch — a program description in the PRISM 
(or JANI) language containing some undefined parameters (holes) with associ- 
ated options from domains. A PRISM program consists of one or more reactive 
modules that may interact with each other using synchronization. A module 
has a set of (bounded) variables that span its state space. Possible transitions 
between states of a module are described by a set of guarded commands of the 
form: 

[action] guard — pj; :update,...... + Pn : update, 


If the guard evaluates to true, an update of the variables is chosen according to 
the probability distribution given by expressions pı through pn. The actions are 
used to force two or more modules to make the command simultaneously (i.e. to 
synchronize). The holes can appear in guards and updates. Replacing each hole 
with one of its options yields a complete program with the semantics given by a 
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finite-state Markov chain. The following sketch describes the PM (the modules 
implementing the other components of the server are omitted for brevity). 


module PM 
pm : [0..2] init 0; // 0 - sleep, 1 - idle, 2 - active 
[sync0] q <= T1*QMAX -> (pm’=P1); 
[sync0] q > T1*QMAX & q <= T2*QMAX -> (pm’=P2); 
[sync0] q > T2*QMAX & q <= T3*QMAX -> (pm’=P3) ; 
[sync0] q > T3*QMAX -> (pm’=PA4) ; 
endmodule 


In our example, we consider the following holes and domains describing: 
the thresholds Tı € {0,0.1,0.2,0.3,0.4},72 € {0.5},73 € {0.6,0.7,0.8,0.9}?, 
the corresponding power profiles P;,...,P, € {0,1,2}, and the queue capacity 
Qmax E {1,...,10}. The resulting sketch describes a design space of 10-5-4-34 = 
16, 200 different power managers where the average size of the underlying MC 
(of the complete system) is around 900 states. 

The goal is to find the concrete power manager, i.e., the instantiation of 
the holes, that minimizes power consumption while the expected number of lost 
requests during the operation time of the server is below 1. Such specification & 
is formalized as a list of temporal logic formulae in the PRISM syntax: 


R{"lost"}<= 1 [ F "finished" ] R{"power"}min=? [ F "finished" ] 


Using the sketch and the specification 6, PAYNT effectively explores the design 
space and finds a hole assignment inducing a program that satisfies ®, provided 
such assignment exists. Otherwise, it reports that such design does not exist. 
For the example, PAYNT produces the following output containing the hole 
assignment and the quality wrt. ® of the corresponding program: 


hole assignment: QMAX=5,T1=0,T2=0.5,T3=0.7,P1=1,P2=2,P3=2,P4=2 
RLexp] {"lost"}=0.6822759696 [F "finished"] 
RLexp]{"power"}min=9100.064246 [F "finished"] 


The obtained optimal power manager has queue capacity 5 with thresholds (after 
rounding) at 0, 2 = [5-0.5| and 3 = |5 - 0.7]. In addition, the power manager 
always maintains an active profile unless the request queue is empty, in which 
case the device is put into an idle state. This solution leads to the expected 
number of lost requests of ~ 0.68 < 1 and the power consumption of 9,100 units. 
PAYNT computes this optimal solution in one minute. This is three times faster 
than a naive enumeration of all solutions. 

Let us consider a more complex variant of the synthesis problem inspired by 
the well-studied model of a dynamical power manger for complex electronic sys- 
tems [5,21]. The corresponding sketch describes around 43M available solutions 
with an the average MC size of 3.6k states. While enumeration needs more than 
1month to find the optimal power manager, PAYNT solves it within 10h. 


? Note that this simply ensures that Tı < Tə < T3. PAYNT further supports restric- 
tions—additional constraints on parameter combinations. 
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3 Synthesis of Probabilistic Programs 


We formalize the synthesis problems supported by PAYNT and briefly present 
state-of-the-art synthesis algorithms; more details can be found in [3, 11,12]. 


Problem Statement 


Sketch. PAYNT uses sketches to define the set of designs. Let P be a sketch 
containing holes from the set H = {Hx}, with Rẹ being the set of options 
available for hole Hp. Let R = I], Rk denote the set of all hole assignments 
(realizations), P[r] denote the program induced by a substitution r € R and D, 
denote the underlying MC. Note that the size of the set R is exponential in |H]. 


Specification. PAYNT supports conjunctions of specifications with reachability 
and expected rewards. For a set T of target states, reachability properties y = 
Pya(F T] with A € [0,1] and we {<,<,>,>} express that the probability 
to reach T relates to A € [0,1] according to ™. Expected reward properties 
e = Epaa[F T] express that the expected reward accumulated before T is reached 
relates to A € RY according to xE {<,<}. Let P[r] H p denote that the 
program P|r] induced by the realisation r satisfies y. For a specification ® = 
{yi}ier given by a finite set of properties, we write P[r] H 2 to denote Vi € I: 


Synthesis problems. PAYNT is able to answer two types of synthesis questions 
for a PRISM sketch P with a set R of realizations and a specification ®: 


Feasibility: Find a realization r € R such that Pir] = &. 


Maximality: For property Ymax, find a realization r* € R such that 


r* € arg max {P[P[r] Æ Ymax] | P[r] = S}. 
rER 


Variants of the maximal synthesis problem for expected rewards and minimiza- 
tion are defined analogously. PAYNT also supports a relaxed variant of max- 
imal synthesis, e-maximal synthesis: find a realization r* such that P[r*] = 8 
and P[P[r*] H Ymax] = (1-€) - max ez {P[Plr] = Ymax] | Plr] H P} for a given 
e € (0,1). 


Existing Synthesis Methods 


Synthesis methods can be classified into two orthogonal groups: i) complete 
methods allowing to prove non-existence or optimally of the given problem, 
and ii) incomplete methods leveraging various smart search strategies and evo- 
lutionary algorithms [22,26,38]. While its architecture is flexible, the current 
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Fig. 3. Oracle-guided synthesis (adapted from [3]). 


release of PAYNT is built around state-of-the-art complete methods. As a base- 
line and reference algorithm, the tool implements the so-called one-by-one app- 
roach [15] which simply enumerates through each realization r € R. The design- 
space explosion renders this approach unusable for large problems, necessitating 
the usage of advanced techniques that exploit any structure of the family of 
MCs. 


Oracle-guided synthesis. At the heart of PAYNT is an oracle-guided induc- 
tive synthesis approach [31,32,46]. A learner selects a realization r and passes 
it to an oracle. The oracle answers whether r satisfies ® and, crucially, gives 
additional information, usually a counter-example (CE), whenever this is not 
the case. PAYNT implements two orthogonal different oracles: (a) an inductive 
oracle CE examines single realizations to infer statements about other realiza- 
tions [11]. (b) a deductive oracle AR (Abstraction Refinement) argues about sets 
of realizations by considering (an aggregation of) these realizations at once [12]. 
PAYNT supports the combined use of these two oracles as a hybrid synthesis 
method [3]. 

Figure 3 [3] illustrates the communication between the learner and the two 
oracles. The Abstr-Oracle analyzes a sub-family R with 3 possible outcomes: 1) 
it proves that all its realizations satisfy ®, i.e., that the synthesis problem is 
feasible, or 2) it proves that all its realizations violate ®, i.e., the learner can 
safely discard R, or 3) the analysis is inconclusive and it returns safe bounds on 
the best- and worst-case behavior of all realizations in R wrt. &. The CE-Oracle 
analyzes a realization r and either proves that r satisfies ® or it generalizes r 
into a subfamily R’. The learner can discard R’ since it is guaranteed that all 
its realizations violate &. In the hybrid approach, the CE-Oracle exploits the 
bounds in order to compute smaller CEs allowing a better generalization. The 
learner maintains a queue of subfamilies R’ C R that has to be further processed 
and also controls which oracle is used based on their previous performance. 


4 Tool Architecture of PAYNT 


PAYNT is implemented on top of the probabilistic model checker Storm [19]. 
While the high-performance parts were implemented in C++, we use a python 
API to flexibly construct the overall synthesis loop. For SMT-solving, we use 
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Fig. 4. The tool architecture (Color figure online) 


Z3 [37]. PAYNT takes a PRISM [35] or JANI [7] sketch and a set of tempo- 
ral properties, and returns a satisfying realization, if such exists. Otherwise, it 
reports that no such realization exists. 

Figure 4 depicts a high-level view on the tool architecture, which primarily 
consists of components for family handling (purple), chain building (green) 
and model checkers (red). 

The family handlers are used to store information about the previously cov- 
ered design space: Member enumeration simply iterates over all realizations. The 
SAT representation stores a SAT-formula describing unexplored realizations and 
uses the SMT solver Z3 for linear (bounded) integer arithmetic to retrieve the 
next candidate realization. The subfamily queue stores a collection of unexplored 
subfamilies and refines these subfamilies as hyper-rectangles. The chain builders 
take as input a single assignment r € R or a set R’ C R of realizations, and 
produce an representation of the MC or a quotient MDP, respectively in the 
internal memory model of the model checkers. The model checkers are then used 
to verify these chains. They either output yes/no or, in the case of MDPs, pro- 
vide lower and upper bounds on satisfiability probabilities. PAY NT includes a 
module for counterexample generation by using either a MaxSat [17,49] or a 
greedy state-expansion [3] approach. 

Figure4 also illustrates three analysis loops that mirror the behaviour of 
1-by-1 enumeration (the baseline), CEGIS and AR. The 1-by-1 approach sim- 
ply iterates over all possible realizations until a satisfying one is obtained. The 
CEGIS loop additionally constructs counterexamples to each unsatisfying real- 
ization r € R, yielding a whole subset R’ C R of realizations that are pruned 
from the family. In contrast to this enumeration, the AR loop constructs and 
model checks MDPs from the subfamily queue and subsequently refines these 
subfamilies if the obtained bounds on satisfiability yield inconclusive results. 
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Table 1. Case study statistics and PAYNT synthesis times versus the naive 1-by-1 enu- 
meration. Two problems per model are considered: an optimal synthesis problem (hard) 
and a feasibility problem (easy). In both cases, all realizations need to be explored to 
prove optimality and unsatisfiability, resp. Values indicated with * are estimates. 


Model Number of parameters | Family size | Average MC size | 1-by-1 enumeration | Tool performance 
Hard Easy 

DPM 16 43M 3.6k 35 days * 9.3h 11h 

Maze 22 9.4M 0.2k 1.8 days* 1h 54min 

Herman | 7 3.1M 1.1k 1.5 days * 17min | 1.1 min 

Pole 17 1.3M 5.6k 1day * 8.5min | 5s 

Grid 8 65k 1.2k 32 min 37s 21s 


The hybrid approach combines both AR and CEGIS approaches and switches 
between the two loops mid-execution. In particular, the integrated method exe- 
cutes the abstraction-refinement loop and, whenever it encounters an undecid- 
able family that needs to be split, CEGIS takes a chance at analyzing it for 
a limited time period. If some family members are excluded based on a coun- 
terexample, the CEGIS engine updates the corresponding SAT representation 
to ensure it does not analyze the same member twice. There are two additional 
links that couple the AR and CEGIS loops and enable efficient integrated anal- 
ysis. First is the use of bounds from MDP model checking during the greedy 
CE generation to allow the construction of larger family-aware conflicts. Since 
these bounds are associated with the states of the quotient MDP M? for the 
(sub-)family and counterexamples are constructed as sub-MCs of the MC D,, 
r € R, in the integrated setting we construct D, directly from MF, to save time 
on converting bound values between the two chains. 

The implementation of PAYNT is composed of 80 Python modules contain- 
ing 7k source lines of code. These metrics consider only our implementation and 
do not include the extensions contributed to STORM and its Python API, invoked 
by PAYNT. All modules adhere to coding conventions for the Python code PEP 
8 [41,43] and are documented with Sphinx for automatic generation of docu- 
mentation. The specific logic components are tested with unit tests to maintain 
their correct functionality. Regression tests verify the accuracy and correctness 
of the synthesis results. Our tests currently cover more than 90% of the source 
code lines. 


5 Performance Evaluation and Applicability 


Table 1 lists the results of PAYNT on two variants (hard and easy) of five 
different case studies from various domains taken from [11,12]. Further on, we 
demonstrate the applicability of PAYNT and interpret the synthesis results for 
two of these case studies. All experiments are run on an Ubuntu 19.04 machine 
with Intel i5-8300H (4 cores at 2.3GHz) and using up to 8 GB RAM, with all 
the algorithms being executed single-threaded. The artefact allowing to repro- 
duce the experiments is avaiable at https: //doi.org/10.5281/zenodo.4726056. 
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Maze. This synthesis problem can be seen as an instance of POMDP controller 
synthesis. A robot is deployed at a random location inside a known maze, see 
Fig. 5. The robot is only equipped with a simple wall sensor, and cannot distin- 
guish maze cells with identical sets of surrounding walls such as cells 1 and 3, 
and cells 11 through 13. Observation-equivalent cells are indicated by the same 
color in Fig.5. Possible actions are movements in the four cardinal directions. 
Movements are subject to a random error: e.g., upon moving east, with a small 
probability the robot actually moves west. We sketch a robot controller that 
helps it to reach the exit of the maze (cell 12). The controller may use a single 
bit of memory initially having the value 0. The holes in this sketch are taken 
actions (where to steer, how to change the memory bit) based on the current 
observation (detected walls, current memory state). This sketch describes a fam- 
ily of 9.4M candidate programs. Our goal is to find a realization that minimizes 
the expected number of steps to reach the exit. 

Using the inductive synthesis tech- 
niques, PAYNT explores the set of 


candidate realizations in an hour (1- LE wo EX 
by-1 enumeration takes more than one 0 a i= 2 2 a 
day) and synthesizes the controller =~ ‘ H Kyra ie po 
depicted in Fig. 5. Here arrows repre- 5 6 7 
sent the steering direction based on ire HIH J 
the current memory value (number 8 9 10 
at the base of an arrow), as well TES ae yee 
as the corresponding memory update i 11 T 1 2 : 1 3 


(number at the tip of an arrow). For 
instance, a robot in cell 1 goes west 


: : Fig. 5. Th tial struct f Maze. Cell 
if the memory value is 0 and goes = eds ea ae seis 


h , ith h h with identical sets of surrounding walls are 
east otherwise, without changing the depicted with similar colors. The arrows 


memory in either case. A robot at cell depict the synthesized controller. (Color 
0 always goes east and sets its memory figure online) 

bit to 1. The synthesized controller is 

optimal. If a robot reaches a cell with a unique set of enclosing walls (cells 0, 
2 and 4), then it knows its precise position within the maze and can navigate 
to the exit. Similarly, navigating north from cells 11 or 13 ensures to eventually 
reach cells 0 or 4. If the robot is deployed in an orange or purple cell, then it 
has to ‘try’ one possible direction in order to recognize its position within the 
maze. For example, a robot deployed at cells 5-10 will first go north (recall that 
the initial memory value is 0), from where it can determine its cell. Note that in 
this observation group it is indeed more beneficial to first explore north since the 
robot is twice as likely to be initially deployed at locations 5/7/8/10, as com- 
pared to locations 6 and 9. The expected time to reach the exit for this policy is 
9.8 steps. Note that this cannot be improved by adding more memory to the 
controller. 
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Herman. This case study considers a token ring with an odd number of stations 
that are connected by a unidirectional ring. Each station has a Boolean flag, 
observable by itself and by its successor in the ring. A station has a token when 
the two flags it observes are identical. A good configuration is a situation in 
which only one station has a token. All other configurations are faulty. A token 
protocol is self-stabilizing, if the ring gets from a faulty configuration into a good 
configuration. The performance can be measured as stabilization time, i.e., the 
expected number of rounds to reach a good configuration. 

We sketch a variant of Herman’s randomized self-stabilization protocol [6, 28, 
34]. In this protocol, all stations behave the same®. The protocol is synchronized, 
and in every round a station without token flips its flag. Every station that has 
a token must choose whether to pass a token (by setting its flag accordingly). In 
the original protocol this choice is the resolved on a single (biased) coin flip. We 
are interested in the synthesis of alternatives. We give each station an additional 
single bit of memory and the choice between 25 different coin biases. The param- 
eters in the sketch are the choice of a coin based on the memory value as well 
as the memory updates. By resolving the choices, we obtain the same protocol 
for each station. The parameter combinations yield a family of 3.1M programs 
and the goal of the synthesizer is to identify the one that minimizes stabilization 
time from an initial configuration (all flags true). For a sketch describing a sys- 
tem with 5stations, PAYNT finds the optimal protocol in around 18 min, while 
the 1-by-1 enumeration takes more than a day. The obtained optimal strategy 
relies on initially using the most fair coins available (bias ~ 0.25) and keeping 
the memory bit at 1. Whenever a process eventually decides to keep the token, 
the memory is reset to 0 and the process starts using highly unfair coins (bias 
0.07), implying that the process is more likely to keep its token for a long 
time until it is eventually passed further. Using this strategy, the system can on 
average stabilize in four rounds. 
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Abstract. In the Adapter Design Pattern, a programmer implements 
a Target interface by constructing an Adapter that accesses an existing 
Adaptee code. In this work, we present a reactive synthesis interpretation 
to the adapter design pattern, wherein an algorithm takes an Adaptee 
and a Target transducers, and the aim is to synthesize an Adapter trans- 
ducer that, when composed with the Adaptee, generates a behavior that 
is equivalent to the behavior of the Target. One use of such an algorithm 
is to synthesize controllers that achieve similar goals on different hard- 
ware platforms. While this problem can be solved with existing synthesis 
algorithms, current state-of-the-art tools fail to scale. To cope with the 
computational complexity of the problem, we introduce a special form of 
specification format, called Separated GR(k), which can be solved with 
a scalable synthesis algorithm but still allows for a large set of realistic 
specifications. We solve the realizability and the synthesis problems for 
Separated GR(k), and show how to exploit the separated nature of our 
specification to construct better algorithms, in terms of time complexity, 
than known algorithms for GR(k) synthesis. We then describe a tool, 
called SGR(k), that we have implemented based on the above approach 
and show, by experimental evaluation, how our tool outperforms current 
state-of-the-art tools on various benchmarks and test-cases. 


1 Introduction 


Inspired by the well known adapter design pattern [18], we study the use of 
reactive synthesis for generating adapters that translate inputs meant for a tar- 
get transducer to inputs of an adaptee transducer. Consider, as one motivating 
example, the practice of adding code to an operating system that mitigates the 
risk posed by newly discovered hardware vulnerabilities like Spectre and Melt- 
down [23,26]. While the discovery of such vulnerabilities puts constraints on how 
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the hardware can be used, the patch of the operating system (called adapter in 
this paper) takes upon itself to take care of running all applications without 
change [25]. It does so by allowing applications of the existing interface, while 
adapting their operation in way that ensures that the system is not exposed to 
the new threat. 

Formally, we propose the following synthesis problem: given two finite-state 
transducers called Target and Adaptec, synthesize a finite-state transducer called 
Adapter such that 

Adaptee o Adapter ~ Target. 


The symbol o stands for standard transducer composition and the symbol =~ 
stands for an equivalence relation, a generalization of sequential equality, which 
we explain below. In words, we want an Adapter that stands between an Adaptee 
and its inputs and guarantees, such that the composition Adaptee o Adapter is 
equivalent to Target. In the vulnerability patching example, Adaptee is a model 
of the constrained hardware and Target is a model of the hardware as used before 
the discovery of the vulnerability, without the new constraints. The Adapter that 
we generate models the patch that mediates between the vulnerable hardware 
and applications that are not aware of the vulnerability. 

In our setting, an input to the synthesis algorithm is the equivalence relation 
along with the specification of the adaptee and of the target. While the problem 
of synthesizing an adapter such that Adaptee o Adapter is sequentially equal to 
Target may be useful in some cases [32], we study here a more general prob- 
lem. This is called for by applications such as the vulnerability covering patches 
described above. Specifically, we allow our users to specify an equivalence rela- 
tion between Adaptee o Adapter and Target that is not necessarily sequential 
equality. In this paper, we propose to use w-regular properties [20] for specifying 
this equivalence relation, as follows. We assume, without loss of generality, that 
the outputs of both the Target and the Adaptee are assignments to disjoint sets 
of atomic propositions. We then consider sequences of pairs of such assignments 
that correspond to zipped runs of Adaptee o Adapter and of Target over the same 
input. Having this set of sequences in mind, the user specifies a set of temporal 
properties using an w-regular formalism such as LTL or Biichi automata. The 
transducer Adaptee o Adapter is considered equivalent to Target if all the prop- 
erties that the user specified are satisfied for each sequence in the set [19]. Note 
that the equivalence relation can be very different than sequential equality, it 
can, for example, say that Adapteeo Adapter must be, in a way, a “mirror image” 
of Target, as demonstrated by the cleaning robots example in Sect. 4.1, where 
Target is a robot that cleans some rooms and Adaptee o Adapter is a robot that 
clean all the rooms that Target did not clean. 

The solution that we propose in this paper consists of two phases: we first 
transform the transducers to transition systems and arrive at a game structure 
that is more amenable for game-based techniques. Then we make use of the spe- 
cific form of the resulting game and some simplifying assumptions about the form 
of the equivalence properties to solve the game efficiently. The game structures 
that we analyze consist of pairs of transition systems called Input and Output, 
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accompanied by a set of w-regular properties that specify equivalence relation 
between the two, as described above. The game that we solve is, then, to find a 
controller that reads the assignments to the variables of the Input and produces 
a valid sequence of assignments to the variables of the Output such that all the 
properties are satisfied. The translation of the transducers to this game structure 
is rather direct, as elaborated in Sect.4. The Input transition system is gener- 
ated from the Target transducer and the Output transition system is generated 
from the Adaptee transducer. This is because we want the Adapter, which we 
generate from the controller as described below, to consider the behavior of the 
Target and to translate it to a command that generates an equivalent behaviour 
of Adaptee. Once we find a controller that solves the game, we can transform it 
to an Adapter as we detail in Sect. 4. 

The synthesis problem that we defined so far is as hard computationally as 
general LTL synthesis and is thus double exponential in the worst case [37]. To 
cope with this difficulty, we propose to use a well known fragment of LTL called 
GR(k). GR(k) generalizes the GR(1) subset of LTL [9], a practical fragment 
of LTL for which a feasible reactive synthesis algorithm exists (see, e.g., [8, 28, 
33]). Furthermore, GR(k) formulas are known to be highly expressive, as they 
can encode most commonly appearing LTL industrial patterns [15,29,30] and 
DBA properties (see related works for details). In addition to using GR(k), 
since the Input and Output in our model are separated transition systems, with 
separated sets of atomic propositions, we focus on properties that separate input 
and output variables. That is, our specification has the form Nali — vi), 
where the ¢; and yY; are conjunctions of LTL GF (Globally in the Future) formulas 
over Input variables only and Output variables only respectively. We call this 
model Separated GR (k). We show through several case-studies that this fragment 
of LTL suffices to specify a range of useful equivalence relations. 

We study the problems of realizability and synthesis on Separated GR(k) 
game. For that, we first consider a sub-problem of solving a weak Btichi game. 
Then we identify and make use of a property of separated games that we call 
delay property: the system can delay its response to the environment indefinitely 
as long as it remains in the same connected component of the game graph. 
This allows us to decide the realizability of Separated GR(k) in O(|y~| + N) 
symbolic operations, and to synthesize a controller for a realizable specification 
in O(|y|N) symbolic operations, where ọ is the Separated GR(k) specification, 
and N is the size of the state-space. Thus, Separated GR(k) games are easier to 
solve that solving GR(k) games which require O(N***k!) operations [35]. This 
demonstrates the efficiency of our framework, since |p| tends to be smaller than 
N and in most practical cases, |p| € O(log(N)). 

The benefits of the complexity-theoretic improvement are reflected in empiri- 
cal evaluations on our case studies of separated GR(k) formulas. We demonstrate 
that while separated GR(k) formulas are challenging for state-of-the-art synthe- 
sis tools, a symbolic BDD-based implementation of our algorithm solves them 
scalably and efficiently. 

The rest of the paper is organized as follows: Sect.2 introduces necessary 
preliminaries. Separated GR(k) games are introduced and formulated in Sect. 3. 
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In Sect. 4 we describe how to use Separated GR(k) games synthesis to generate 
the adapter transducer, and introduce several use-cases. Next, we turn to solving 
separated GR(k) games. An overview of our solution approach and a necessary 
property for correctness of algorithm, called the delay property, is given in Sect. 5. 
A complete symbolic algorithm is presented in Sect.6. An empirical evaluation 
on case-studies is presented in Sect. 7. Finally, in Sects. 8 and 9 respectively, we 
give related work and conclude. Detailed proofs appear in the full version of the 
paper [3]. 


2 Preliminaries 


General Definitions. Given a set of Boolean variables V, a state over V is 
an assignment s to the variables in V. We describe s as the subset of V that 
is assigned True in s. The set of primed variables of V is V = {v' | v € V} 
Then s = {v | v € s} is the primed state s’ over V’. An assertion over V is 
a Boolean formula over variables V. A state s satisfies an assertion p over the 
same variables, denoted s — p, if p evaluates to True by assigning true to the 
elements of s. We define the projection of a state s on a subset U C V as denoted 
by sly = s NU. We extend the notion of projection to a set of states S C 2” by 
defining Sz = {sly | s € S}. 

Our specification is a special form of Linear Temporal Logic (LTL). LTL [36] 
extends propositional logic with infinite-horizon temporal operators. The syntax 
of an LTL formula over a finite set of Boolean variables V is defined as follows: 
pr =veEV|raAg|yAg|yeVy|Xy| pUy | Fy | Gy. Here X (Next), U (Until), 
F (Eventually), G (Always) are temporal operators. The semantics of LTL can 
be found in [5, Chapter 5]. 

We model the adapters as transducers. A transducer is a deterministic finite- 
state machine with no accepting states, but with additional output alphabet and 
an additional function from the set of states to the output alphabet. A formal 
definition of a transducer is not required for this paper. 

The algorithms developed in this paper are symbolic, i.e. manipulate implicit 
representations of sets of states. To this end, we use Binary Decision Diagrams 
(BDDs) [10] to represent assertions. For a BDD B and sets of variables V1,- + Vn, 
we write B(V1,..., Vn) to denote that B represents an assertion over Vj U- --UVp. 
For a state s over V, we write s = B(V) to denote that the assertion that B 
represents is satisfied by the state s. BDDs support several symbolic operations: 
conjunction (V), disjunction (A), negation (~), and extraction of variables using 
the 4 and V operators. We measure time complexity of a symbolic algorithm 
by a worst case #symbolic-operations it performs. A discussion on a rigorous 
treatment of BDD operations can be found in the paper’s full version [3]. 


Game Structures and Games. We follow the notations of [9]. A game struc- 
ture GS = (Z,0,62,00,pz,po) defines a turn-based interaction between an 
environment and a system players. The input variables Z and output variables 
O are two disjoint sets of Boolean variables that are controlled by the envi- 
ronment and system, respectively. The environment’s initial assumption Oz is an 
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assertion over Z, and the system’s initial guarantee ĝo is an assertion over TUO. 
The environment’s safety assumption pz is an assertion over T U OUT’, where 
the interpretation of (io, 09, 71) = pz is that from state (io, 09) the environment 
can assign 7, to the input variables. W.l.o.g, we assume that pz is deadlock free, 
i.e., for all (io, 09) there exists an i, s.t. (io, 00,71) H pz. Similarly, the system’s 
safety guarantee po is an assertion over ZUOUZ’ UO’, where the interpretation 
of (io, 00, 44, 01) F po is that from state (io, 09) when the environment assigns i 
to the input variables, the system can assign 0; to the output variables. Again, 
w.lo.g, we assume that po is deadlock free, i.e., for all (io, 00, i1) there exists an 
01 S.t. (to, 00, 44, 01) = po. 

A play over GS progresses by the players taking turns to assign values to their 
own variables ad infinitum, where the players must satisfy the initial conditions 
at the start and the safety conditions thereafter. Formally, a play 7 = Sọ, 51,... is 
an infinite sequence of states over TUO such that so = @z/A8o and (sj, s441) = 
pr^ po for all j > 0. A play prefix is either a play or a finite sequence of states that 
can be extended to a play. Then a strategy is a function f : (27VU°)+ x 27 — 2° 
such that if s9,...,Sm is a play prefix, (sm, i’) H| pz and f(so,...,5m,7) = 0, 
then (Sm,i’,0’) H po. Intuitively, a strategy directs the system on what to 
assign to the output variables, depending on the history of a play and the most 
recent assignment by the environment to the input variables. A play prefix is 
said to be consistent with a strategy f if for all states s; = (ij, oj) in that 
prefix, f(so,...,8;-1,%;) = 0; for all j > 0. A strategy is memoryless if it only 
depends on the last state and the most recent assignment to the input variables. 
Formally, a memoryless strategy is a function f : (2700) x 27 — 2° such that if 
(sm, i) = pz and f(sm, 1’) = 0, then (sm, t, 0) HE po. 

A game is a tuple (GS, p) where GS is a game structure over inputs Z and 
outputs O and ¢ is an LTL formula over T UO called a winning condition. A 
play m is winning for the system if m = y. A strategy f wins from state s if every 
play m from s that is consistent with f is winning for the system. A strategy 
f wins from S, where S is an assertion over Z U O, if it wins from every state 
s H| S. The winning region of the system is the set of states from which it has a 
winning strategy. A strategy f is winning if for every state i - Oz there exists 
a state o € 2° such that (i,0) = 0o and f wins from (i, o). In this paper, we 
have the following games that are defined over the following winning conditions. 


— Reachability games: F(p) where ¢ is an assertion over T U O. 
— Safety games: G(y) where y is an assertion over T U O. 
— Biichi games: chy ) where is an assertion over J UO. 


— GR(k) games: Ne (Ai GF(yr:) > Njaa GF (Yı j)) where all yı; and yy; 
are assertions over Z U O. 


Given a game (GS, ), realizability is the problem of deciding whether a win- 
ning strategy for the system exists, and synthesis is the problem of constructing 
a winning strategy if one exists. We note that a realizability check can be reduced 
to the identification of the winning region, W: A winning strategy exists iff for 
all i — Oz there exists o € 2° such that (i, 0) H 90 and (i, 0) € W. Hence, the 
synthesis problem can be solved by constructing a strategy that wins from W. 
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Game Graphs and Weak Biichi Games. The game graph for a game 
structure GS is the directed graph (V, E) with vertices V = 27¥° and edges 
E = {(s,t) | (s,t') E pz ^ po}. Intuitively, vertices are states over Z and O, 
and edges represent valid transitions between states according to the safety con- 
ditions. The game graph can be useful for analyzing the structural properties of 
a game structure via graph-theoretical properties. 

A finite path in a directed graph (V, E) is a sequence vo,...,Un € Vt such 
that (vj, vj+1) E€ E for all 0 < j < n. An infinite path vo, v1,... € V® is similarly 
defined. A vertex u is said to be reachable from another vertex v if there is a 
finite path from v to u. A strongly connected component (SCC) of a directed 
graph (V, E) is a maximal set of vertices within which every vertex is reachable 
from every other vertex. It is well known that SCCs partition the set of vertices 
of a directed graph, and that the set of SCCs is partially ordered with respect 
to reachability. Also note that every infinite path ultimately stays in an SCC. 

Let (GS, GFy) be a game with a Büchi winning condition, and let So... , Sm 
be the set of SCCs that partition the game graph of GS. We say that (GS, GFy) 
is a weak Buchi game if, given the set F of states that satisfy the assertion y, 
for every SCC $j, either S; C F or Si NF = Ø. Thus, the SCCs of a weak Biichi 
game are either accepting components, meaning all of its states are contained in 
F, or non-accepting components, meaning none of its states is present in F. Asa 
consequence, a play in a weak Biichi game is winning for the system if the play 
ultimately never exits an accepting component. Similarly, a strategy is winning 
for the system if it can guarantee that every play will ultimately remain inside 
an accepting component. 


3 Separated GR(k) Games 


Our framework relies on the core idea of reducing the problem of adapter genera- 
tion to synthesizing a Separated GR(k) game, which we define in this section. At a 
high-level, a separated GR(k) differentiates from a regular GR(k) game in a sepa- 
ration between input and output variables in both the game structure and winning 
condition. We show in later sections that the separation of variables leads to algo- 
rithmic benefits to the synthesis problem. Formally we have the following. 


Definition 1. A game structure GS = (Z,O, 01,00, pz, po) separates variables 
over input variables T and output variables O if: 


— The environment’s initial assumption Oz is an assertion over T only. 

— The system’s initial guarantees 00 is an assertion over O only. 

— The environment’s safety assumption pz is an assertion over I UT’ only. 
— The system’s safety guarantee po is an assertion over O UO’ only. 


The interpretation of a game structure which separates variables is that the 
underlying game graph (V, E) is the product of two distinct directed graphs over 
disjoint sets of variables: Gz over the variables TUT’, and Go over the variables 
OUO’. For J € {T,O}, the vertices of Gz correspond to states over J and 
there is an edge between states s and t if (s,t’) EF pz. 
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Next, the notion of separation of variables extends to games with GR(k) 
winning conditions as follows: 


Definition 2. A GR(k) winning condition p over TU O separates variables 
wrt. T and O ifẹ = Nia (Ay GF lor) > A32 GF(ti,;)) such that each pii 


is an assertion over Z and each yı j is an assertion over O. 


A Separated GR(k) game is a GR(k) game (GS, p) over TU O in which both 
GS and ọ separate variables w.r.t. Z and O. 

A major observation is that in a game played over a separated game structure, 
the actions of the two players are independent: the environment’s actions do 
no limit the system’s actions, and vice versa. In later sections we see how this 
observation leads to algorithmic improvements in solving separated GR(k) games 
over a regular GR(k) game. Specifically, in Sect. 4 we see how to use Separated 
GR(k) games to generate the adapter transducer. In Sects.5 and 6 we discuss 
algorithms for realizability and synthesis of Separated GR(k) games. 


4 From Transducers to Separated GR(k) 


We describe, using an end-to-end-example, how adapter transducer generation 
can be reduced to synthesis of Separated GR(k) games. 

We begin with user-provided Target and Adaptee transducers. These trans- 
ducers model the behavior of a system that we want to use (Adaptee) and the 
behavior of a system that we want to emulate (Target). For example, the transi- 
tion systems in Fig. 1 formulates the following scenario. (1) Target is an hardware 
interface that we want to support, such that the U (up) and the D (down) com- 
mands send the hardware from mode sp to modes sı and s2, respectively, from 
which the S (stay) command keeps the system looping at the chosen mode. (2) 
Adaptee that is a hardware that we can use that also has three modes, but which 
does not allow the command S$ after U. Instead, it allows a D command that 
switches the mode back to so. 


Target Adaptee 
S| nti A to@@) 
U |ti A ato D|7a1 ^A nao U | ~aı A ao 
S|nti A ato >@o) S | 7a1 ^A ~ao —> (60) 
D |nti A to D | ai A 7a9 


S |ti Ato S) S | a1 A nao ÈL) 


Fig. 1. An example of Target and Adaptee transducers. In this example, the t; and a; 
variables encode the binary representation of the mode being moved to. 
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The second step is a formulation of the equivalence relation, where we define 
the type of emulation that we require. In our example we want to maintain the 
following property: if Target visits a mode s; infinitely often for a certain input 
sequence, then so does Adaptee o Adapter. This can be expressed in LTL as: 


/\ GF(bine(si)) > GF(bina(si)) 


i=0 


where bin;(s;) denotes the binary representation of mode s; using variables t1, to, 
and similarly for bing(s;) using variables a,,a9. Note that in this example we 
cannot just synthesize an adapter that cycles through all modes in Adaptee o 
Adapter infinitely often, since the Adaptee transducer does not allow that. 

As a third step, to generate a separated GR(k) game, we translate the Target 
and Adaptee transducers to Input and Output transition systems as depicted, for 
example, in Fig. 2. Since Adaptee and Target are two separate transducers, each 
with its own structure, it is natural to model these as two separate transition 
systems on distinct variables. Thus, the transition systems are produced by the 
well known projection construction that turns an FST into a FSA that accepts 
the output language of the transducers [32]. Note that in our setting Target is 
translated to Input and Adaptee is translated to Output. This may appear as a 
role inversion to readers. We propose it because the role of the controller in our 
setting is to translate the behavior of Target to an equivalent behavior of the 
Adaptee. 


Input Output 


ESN Ea Aso] 


Fig. 2. A direct translation of the Target transducer to an Input transition system and 
of the Adaptee transducer to an Output transition system. 


These separate transition systems, together with the specification described 
above, form a Separated GR(k) that, as a fourth step, we can feed to the Sep- 
arated GR(k) synthesis algorithm. The output of the algorithm is a transducer 
called Controller, that maps runs of Input to runs of Output, as shown, in our 
example, in Fig. 3. This, in fact, connects the output of the Target to the output 
of the Adaptee. 

As a final step, from the controller we can construct the Adapter using the 
formula Adapter = Adaptee~' o Controller o Target. This means that Adapter 
contains an internal model of the Target and of the Adaptee. These internal 
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iti A to | a1 A “ao 
ati A to | ~ai A nao 
ti A ato | nai A “nao 


ati A ato | mar A ao 
ati A to | >01 A ao 


ti A ato | a1 A “ao 


Fig. 3. A controller that reads runs of the Input transition system and generates runs 
of the Output transition system such that the specified Separated GR(2) formula is 
guaranteed to be true. 


models are used to translate inputs to expected outputs of the adapter, then 
feed them to the controller, and then feed the output of the controller to the 
reverse of Adaptee to generate an input to Adaptee that emulates the behaviour 
of Target. Note that it is possible to invert transducers symbolically [21]. 


4.1 Additional Usages of Our Technique 


We give two more examples to demonstrate uses of Separated GR(k). 


Cleaning Robots. This example demonstrates how one can use our technique 
to fulfill tasks that have not been covered by an execution of an existing trans- 
ducer. Consider a cleaning robot (the Target transducer) that moves along a 
corridor-shaped house, from room 1 to room n. The robot follows some plan 
and accordingly cleans some of the rooms. Our goal is to synthesize a con- 
troller that activates a second cleaning robot (the Adaptee transducer) that 
follows the first robot and cleans exactly those rooms left uncleaned. Each 
robot controls a set of variables indicating which room they are in and which 
rooms they have cleaned, and additionally the original robot controls a vari- 
able indicating whether it is done with its cleaning. Our controller is required 
to fulfill requirements of the form: GF(done) A GF(!in:clean;) —> GF(out:clean,), 
GF(done) A GF(in:clean;) + GF(!out:clean;). 


Railway Signalling. This example demonstrates how one can use our tech- 
nique to improve the quality of an existing transducer. We consider a junction 
of n railways, each equipped with a signal that can be turned on (light in green) 
or off (light in red). Some railways overlap and thus their signals cannot be 
turned on simultaneously. We consider an overlapping pattern where railways 
1-4 overlap, and similarly 3-6, 5-8, and so on. 

An existing system (the Target transducer) was programmed to be strictly 
safe in order to avoid accidents, so it never raises two signals simultaneously. 
We want to improve the system’s performance by synthesizing a controller that 
reads the assignments that the existing transducer produces and accordingly 
assign values to the signals in such a way as to produce both safe and mazimal 
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valuations: the ith signal is turned on if and only if the signal of every rail 
that overlaps with the ith rail is off. Furthermore, we want to maintain liveness 
properties of the Target system: (1) every signal that is turned on infinitely often 
by the existing system must be turned on infinitely often by the new system as 
well, and (2) if a signal is turned on at least once every m steps (where m is a 
parameter of the specification) by the existing system, then the same holds for 
the new system. 

Note that, in terms of the GR(k) formula, this example is similar to the 
“hardware” example that we gave; we want to emulate the Target’s execution. 
The crux of the example lies in its Adaptee. Here, unlike in the explanatory 
example, the Adaptee is not a given hardware, but rather a virtual component 
that the user introduced to improve the Target performance. In this case the 
Adaptee produces safe and maximal signals. 


5 Overview for Solving Separated GR(k) Games 


The adapter generation framework described in Sect.4 relies on synthesizing 
a controller from a separated GR(k) game. In this section and the next, we 
describe how to solve separated GR(k) games. This section gives an overview of 
the algorithm in Sect.5.1 and describes a necessary property, called the delay 
property, in Sect. 5.2. The delay property is necessary to prove correctness of our 
synthesis algorithm. Later, Sect.6 gives the complete algorithm and proves its 
correctness. 


5.1 Algorithm Overview and Intuition 


Following Sect.3, we are given a Separated GR(k) game that consists of a game 
structure GS and a winning condition in a GR(k) form y = As Pı, where 
gr = Aji, GF(aii) > Aja GF(g,;). Let G be the game graph of GS. Consider 
an infinite play m in GS. Like every infinite path on a finite graph, 7 eventually 
stabilizes in an SCC S. Due to separation of variables, the game graph G can 
be decomposed into an input graph Gz and an output graph Go. Then the 
projection of S on the inputs is an SCC Sz in Gz, and the projection of S on 
the outputs is an SCC So in Go. The input side of 7 converges to Sz whereas 
the output side m converges to So. 

Now, let S be an SCC with projections Sz on Gz and So on Go. Then 
we call S accepting if for every constraint yı, where | € {1,...,k}, one of the 
following holds: 


All guarantees hold in S. For every j € {1,..., mı}, there exists o € So such 
that o = gij. 

Some assumption cannot hold in S. There exists j € {1,..., nı} such that 
for all i € Sz, i F aj. 


Then from the definition of an accepting SCC we have the following: a strat- 
egy that makes sure that every play converges to an accepting SCC, in which 
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all the relevant guarantee states are visited, is a winning strategy for the system 
in (GS,). To synthesize such a strategy, we do the following: (i) synthesize a 
strategy fp for which every play converges to an accepting SCC; (ii) synthesize 
a strategy ftravel that travels within every accepting SCC, satisfying as many of 
the gi,; guarantees as possible. (iii) construct an overall winning strategy f that 
works as follows: the system plays fg until reaching an accepting SCC S, then 
the system switches to firave: to satisfy as many of the gi,; guarantees in S as 
possible; if the environment moves the play to a non-accepting SCC, the system 
can start playing fg again to reach a different accepting SCC. 

The strategy fg can be found by synthesizing the weak Butchi game 
(GS, GF(acc)), where acc is the assertion that accepts exactly those states that 
belong to accepting SCCs (note that (G'S, GF(acc)) is a well defined weak Biichi 
game). firavey can be constructed by simply finding a path in So that satisfies 
the maximum number of guarantees. 

A complication arises however when switching between firave: and fp, since 
it is conceivable that while the system is following ftrave:, the environment could 
move to a different SCC that is outside of the winning region of fg. Thus, it 
is not clear that we can combine these strategies to make an overall winning 
strategy for the system. To show that we can indeed combine both strategies, 
we need the following property that we call the delay property: if (i1, 01) is a 
state in the winning region of fp, and (iz, 09) is a state for which there is a 
path in Gz from i, to iz and a path in Go from oo to 04, then (iz, 09) is also 
in the winning region of fg. We formally state and prove the delay property in 
Sect. 5.2. In Sect. 6 we give details of the construction of fp, firavey and the use 
of the delay property to prove correctness of the overall winning strategy f. 


5.2 The Delay Property 


The delay property essentially says that if an SCC S is contained in the winning 
region, and the environment moves from S$' unilaterally to a different SCC S’, 
then S” is also in the winning region of the system. In this section, we prove that 
the Biichi game (G'S, GF(acc)) where GS = (Z,O, 62, 00, pz, po), as defined in 
Sect. 5.1, satisfies the delay property. Throughout this section, we write Gz and 
Go to denote the graphs over 27 and 2°, respectively, as in Sect.5.1. We start 
with the following lemma that states that the system can still win in spite of a 
single step delay. 


Lemma 1. Let ig, i; € 27 such that (io, i1) =| pr, and assume that the system 
can win from (tg, 09). Then the system can also win from (i, 09). 


Proof. Let f be a winning strategy for the system from (io, 09). We construct a 
winning strategy fa from (i1, 09). Intuitively, fy acts from state (i, 09) as if it 
were following f from state (io, 09), with a delay of a single step: the input in 
the current step is used to choose the output in the next step. 

We use f to define fg inductively over play prefixes of length m > 1, by setting 
falli, 00), eey (im; Om—1); imi) = f((%, 00); ...; (im-1; Om—1); tus = Note that 
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fa is well defined since GS separates variables: from state (i, 0), the outputs 
that can be chosen for the successor state depend only on o, and not on i. 
Note that by this definition, for every play (i, 09), (i2, 01),---;(¢m+1; Om);--- 
consistent with fa, the play (io, 00), (i1, 01),---,(%m; Om),--- is consistent with 
f. We remark that we define fa only for proving the lemma, and it is not part 
of our solution. 

Next, we show that fa is winning from (%,09). Take a play 
(i, 09), (i2, 01),-.., consistent with fa. By the construction, (io, 09), (41, 01),..- 
is consistent with f. Since this is a play on a weak Biichi game, after some point 
it must remain in a single SCC S, say from state (ij, oj). Since f is a winning 
strategy, the SCC S must be accepting. Then 0j, 0;+1,... is an infinite path in 
the SCC Slo, and i;,i;41,... is an infinite path in the SCC S|z. Consequently, 
(i1, 09), (i2, 01),... converges to an SCC $ in which $|z = S|z and lo = Slo. 
Since the conditions for an SCC D to be accepting depend only on the relation 
between D|z and D|o, we have that Sis accepting since § is accepting as well. 


We can now prove the delay property, following by straightforward induction 
from Lemma 1. 


Theorem 1 (Delay Property Theorem). Let ig,...,in E€ (27)+ be a path 
in Gz, and for m > 0, let o-m,...,09 € (2°)* be a path in Go. Assume that 
the system can win from (io, 00). Then the system can also win from (in, 0-m)- 


Proof. From (in, 0m), the system can simply ignore the inputs and follow the 
path in Go to oo. Let (in}m, 00) be the state at that point in some play. Note 
that there is a path between in and i,4,,, and therefore there is a path between 
ig and tn4m- If the system can win from (io, 09) then by using Lemma 1 in 
the induction steps, the system can win by induction from (i, 09) for all i such 
that there is a path in between ip and i. Therefore, the system can win from 
(tn+m, 00), and by consequence from (in, 0-m). 


A corollary of Theorem 1 is the following statement about the structure of 
the winning region of the weak Biichi game B = (GS,GF(acc)) as defined in 
Sect. 5.1. 


Corollary 1. The winning region of B is a union of SCCs. 


Proof. Let (i, 0) be a state in the winning region of B, let (t, ô) be a state in 
the same SCC S' of (i, o), and let S|z and S|o be the projections of S on Gz 
and Go, respectively. Then there is a path ip,..., în for some n > 0 in S|z such 
that io = i and î = in. Similarly, there is a path o_»,..., 09 for some m > 0 in 
Slo such that 69 = o and 6 = o-m. Then by the delay property of Theorem 1, 
the vertex (7,6) = (in, 0-m) is also in the winning region of B. 


We use Theorem 1 and Corollary 1 in the proof of correctness of the overall 
winning strategy f, as described in Sect. 6.2. 
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6 Algorithms for Solving Separated GR(k) Games 


In this section we provide the exact details of our synthesis algorithm for Sep- 
arated GR(k) games, as described in Sect.5.1. Since constructing fg involves 
defining and solving a weak Biichi game, we first describe these in Sect.6.1. 
We remark that our weak Btichi game synthesis algorithm works for all weak 
Büchi games, and not just for the special weak Biichi game defined in Sect. 5.1. 
Specifically, it works even when the underlying game structure does not sepa- 
rates variables. Next, in Sect.6.2, we complete the algorithm construction and 
describe the correctness of our overall synthesis algorithm. 


6.1 Realizability and Synthesis for Weak Biichi Games 


We present a symbolic algorithm to solve synthesis of a weak Biichi game. When 
represented in explicit state-representation, weak Biichi games are known to be 
solved in linear-time in the size of the game [12,27]. In this section, we adapt 
the algorithm from [12,27] to symbolic state-space representation. For sake of 
exposition, we give an overview of the algorithm and then present our symbolic 
modification. 


Overview Given a weak Biichi game, recall that each SCC in its game graph 
G is either an accepting SCC or a non-accepting SCC. The goal is to find the 
winning regions in the weak Biichi game. This can be done by backward induc- 
tion on the topological ordering of the SCCs as follows. Let (So,... Sm) be a 
topological sort of the SCCs in G. 


Base Case: Consider all terminal partitions, say S;,..., Sm; that is, every SCC 
from which no other SCC is reachable. In this case, plays beginning in a terminal 
SCC will never leave it. Therefore, all states of terminal SCCs that are accepting 
are in the winning region of the system and all states of terminal SCCs that are 
non-accepting are not in the winning region of the environment. 


= 
Induction Step: Let S = (Si+1,.--, Sm), and suppose that the set Us has 
been classified into winning regions for the system W},, and the environment 


W421, respectively. Let Bey = (Sj, Sj41,---,5;) be the SCCs from which all 


v7 

edges leaving the SCC lead to an SCC in 5. Further, let A and N be the unions 
of all accepting SCCs and all non-accepting SCCs in EN respectively. Then 
the basic idea is as follows: The system can win from s € N if and only if it can 
force F(W;,,) from s. Analogously, the system can win from s € A if and only if 
it can force G(AU W3,,) from s. Hence, by solving these reachability and safety 
games, we can update W$, and W*,, into W? and W; that partition the larger 
set U(5;,..., Sm) into winning regions for the system and the environment. The 
winning strategy can be constructed in a standard way as a side-product of the 
reachability and safety games in each step, see for example [40,41]. 
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Symbolic Algorithm for Weak Btichi Games. Given a weak Biichi game 
B = ((Z,0,6r, 00, pz, po), GF(acc)) with BDDs representing 0z, 00, pr, po 
and acc, our goal is to compute a BDD for the winning region and to synthe- 
size a memoryless winning strategy for the system. The construction follows a 
fixed-point computation that adapts the inductive procedure described in the 
overview: In the basis of the fixed point computation, the winning region is 
the set of accepting terminal SCCs; in the inductive step, the winning region 
includes winning states by examining SCCs that are higher in the topological 
ordering on SCCs. In what follows we describe a sequence of BDDs that we con- 
struct towards constructing the overall BDD for the winning region. We use the 
notation X to denote a set of variables over Z UO. For the sake of the current 
construction, memoryless strategies are given in the form of BDDs over X, X’, 
for further details on the BDDs constructions see the full version for details [3]. 


BDD constructions. We start by constructing a BDD for a predicate that indi- 
cates whether two states in a game structure are present in the same SCC. Let 
predicate Reach(s, ¢’) hold if there is a path from state s over T U O to state t 
over TUO in the game structure GS. Similarly, a predicate Reach '(s, t’) holds 
if and only if Reach(t, s’) holds. BDDs for Reach and Reach! can be computed 
in O(N) symbolic operations using the transition relation of the game structure. 
Then, a BDD indicating if two states share the same SCC, is constructed in 
O(N) symbolic operations by SCC(X, X’) := Reach(X, X’) A Reach” '(X, X’). 
Next, we construct a BDD for the union of the terminal SCCs, required by 
the basis of induction for the construction of the winning region. Let predi- 
cate Terminal(s) hold if state s over T U O is present in a terminal SCC. Then 
Terminal(X) := YX’ : Reach(X, X’) —> SCC(X, X’). Therefore, given BDDs for 
Reach and SCC, the construction of Terminal requires O(1) symbolic operations. 


Computing the Winning Region. We now describe the fixed-point computa- 
tion to construct a BDD for the winning region in a weak Büchi game. Let 
Reachability’ ,7,y)(X) denote a BDD generated by solving a reachability game 
that takes as input a set of source states M and target states N and outputs 
those states in M from which the system can guarantee to move into N. Simi- 
larly, let Safety; m )(X) denote a BDD generated by solving a safety game that 
takes as input a set of source states M and target states N and outputs those 
states in M from which the system can guarantee that all plays remain inside the 
set N. These constructions are standard, details can be found in [20, Chapter 2]. 

Now, let Win(s) denote that state s over T U O is in the winning region. 
Then, Win(X) is the fixed point of the BDD Win_Aux defined below, where 
the construction essentially follows the high-level algorithm description. The 
BDD Acc(X) represents the formula acc encoding the set of accepting states. 
In addition, DC’(X) is the union Us of the Downward-Closed set of SCCs, 
i.e. the SCCs that have already been classified into winning or not-winning, 
and DCi,.,,(X) is the union U Snes of the SCCs in DC’(X) that were not in 


new 


DC’"!(X). Finally, N’(X) is the subset N of non-accepting states in DC’,..,,(X), 


and A*(X) is the subset A of accepting states in DC’,.,,(X). We then define 
Win_Aux as follows. 
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Base Case. 
1: Win_Aux?(X) := Terminal(X) A Acc(X) 
2: DC°(X) := Terminal(X) 


Inductive Step. 
DC*t?(X) := VX": Reach(X, X’) > (SCC(X, X’) v DC*(X’)) 
DCnew(X) == DC (X) \ DC'(X) 
i+t1( X) := DO (X) AaAcc x) 
AMT X) = DCE (X) A Acc( X) 


Win_Aux’*?(.X) := Win_Aux'(X) V Reachability (yi+1(x),win_Auxi(x))(X) 
V Safety qi+1(x) ,ai+1(X)vWin_Auxi(x))(X) 


a 
= 
t 

AF 


To explain the construction of Win, note that a state s in DC'T'(X) is 
winning in one of these cases: (i) s is a winning state in DC’(X). (ii) s is a 
non-accepting state in DC’T'(X) from which the system can force the play 
into a winning state in DC'(X). This set of states can be obtained from 
Reachability (yi+1(x),win Aux! (x))(X). (iii) s is an accepting state in DC’*"(X) from 
which the system can guarantee that every play that leaves the accepting SCC 
moves into a winning state in DC’(X). This set of states can be obtained from 
Safety (qi+1(x),ai+1(X)vWin-Auxi(x)) (X). 

Finally, to check realizability, construct the BDD VYZ(InitIn(Z) — 
JO (InitOut(O) A Win(Z U O))), where InitIn(Z) and InitOut(O) are BDDs repre- 
senting 0z and ĝo, respectively. This BDD is equal to true iff B is realizable. 

The fixed-point computation can be extended in a standard way to also 
compute a BDD representation Fb(X, X’) of the winning strategy fg, such that 
(s, (i’,0')) H| Fb(X, X’) iff fe(s,i) = o, as we elaborate in the full version [3]. 
We then have the following theorem that follows our construction. 


Theorem 2. Realizability and synthesis for weak Büchi games can be done in 
O(N) symbolic steps. 


Proof Outline. The proposed construction symbolically implements the induc- 
tive procedure of the explicit algorithm. Hence, it correctly identifies the system’s 
winning region. It remains to show that the algorithm performs O(N) symbolic 
operations. First of all, the constructions of SCC and Terminal take O(N) sym- 
bolic operations collectively. It suffices to show that in the i-th induction step, 
solving the reachability and safety games performs O(|DC'T' \ DC’|) operations. 
This can be proven by a careful analysis of the operations and the sizes of result- 
ing BDDs using standard results on safety and reachability games. 


6.2 Realizability and Synthesis for Separated GR(k) Games 


We finally make use of the elements obtained so far towards solving synthesis for 
Separated GR(k) games. Our construction follows the overview from Sect. 5.1. 
To recall, we describe and construct two auxiliary strategies fg and ftrave; and 
combine them to generate the final strategy f. We use the delay property theorem 
from Sect. 5.2 to prove the correctness of our algorithm. 
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We are given a Separated GR(k) game structure GS = (T, O, 62, 80, pr, po) 
and a winning condition y = Ai yi, where gy = Aj, GF(aii) > 
Na GF(g,;)). We first represent GS and y as BDDs by standard means. We 
then define and construct the following. 


Constructing fg. Auxiliary strategy fg is the winning strategy of the system 
player in a weak Biichi game constructed form the separated GR(k) game. To 
construct a weak Biichi game, we first construct, in O(|y~| + N) symbolic opera- 
tions, a BDD Acc(ZUO) that describes the set of accepting states. The construc- 
tion is standard. Next, let acc be the assertion represented by Acc (the assertion 
defined in Sect.5.1). Then the weak Biichi game is B = (GS, GF(acc)). Finally, 
we construct fg as the winning strategy of B, following Sect. 6.1. 


Constructing ftravet. For the construction of ftravel, we arbitrarily order all guar- 
antees that appear in our GR(k) formula: garg,...,gar,,_,. For each guarantee 
gar, we construct a reachability strategy frj) that, when applied inside an 
SCC So in the output game graph Go, moves towards a state that satisfies gar; 
without ever leaving So. In case no such state exists in So, frj) returns a distin- 
guished value L. Note that this strategy can entirely ignore the inputs. We equip 
firavet With a memory variable mem that stores values from {0,...,m-—1}. Then 
firavei(S, i) is operated as follows: for mem, mem+1,... we find the first mem+j 
(mod m) such that the SCC of s includes a gar ,-state, and activate f;(mem-+j) 
to reach such state. If no guarantees can be satisfied in S, we just return an 
arbitrary output to stay in So. The construction of ftraver requires O(|y|N) sym- 
bolic BDD-operations as we need to construct m reachability strategies (clearly, 
m < |ọl). 


Constructing the overall strategy f. Finally, we interleave the strategies fg and 
firavel into a single strategy f as follows: given a state s and an input i, if 
s — Acc(X) (that is, if s is an accepting state), then set f(s,i) = ftravei(s, i); 
otherwise set f(s,i) = fs(s,i). Whenever f switches from fs to ftravei, the 
memory variable mem is reset to 0. The next lemma proves that if fg is winning 
then so is f. 


Lemma 2. If fg is a winning strategy for the weak Büchi game B = 
(GS,GF(acc)), then f is a winning strategy for the Separated GR(k) game 
(GS, p). 


Proof. Since fg is a winning strategy, then for every initial input i = Oz there 
is an initial output o |= ĝo such that (i,o) is in the winning region of GS. 
We show that playing f always keeps the play in the winning region of GS, and 
therefore the play eventually converges to an accepting SCC. Once this happens, 
following ftraver guarantees that vy is satisfied. We know that as long as the play 
is in the winning region of B, following fg will keep it inside the winning region. 
Therefore, when we switch from fg to firayey we must be inside the winning 
region and, by definition of f, in some accepting SCC S. Then ftraveı makes sure 
that as long as the environment remains in $|z, the projection of S over the 
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inputs, the system remains in S|o, the projection of S' over the output. Thus all 
in all the play remains in the winning region of S. 

Therefore, the only way that the play can leave the winning region is if, 
when the system is in a state (io, 09) and chooses some output 0—m according to 
firavet, the environment chooses input ip such that the play leaves S and moves 
to a state (in, 0-m) in a different SCC of G. Note, however, that in this case 
there is a path from ʻio to in and a path from o_,, to oo (since by construction 
firavet remains in the same SCC in Go). Since (io, 09) is in the winning region, 
by Theorem 1 we have that (in, 0-m) is in the winning region as well. 


Final Results. Given Lemma 2, we can obtain our final results on synthesis 
and realizability of Separated GR(k) games, as follows. Given a Separated GR(k) 
game (GS, p), construct acc and solve the weak Biichi game ( GS, GF(acc)). Then 
construct fp, ftravei and f as described above. If realizable, then fg is a winning 
strategy and from Lemma 2 we have that f is a winning strategy for (GS, y). 
If (GS, GF(acc)) is unrealizable, then the environment can force every play to 
converge to a non-accepting SCC. Since the GR(k) winning condition cannot be 
satisfied from a non-accepting SCC, (GS, y) is also not realizable. Thus we have 
the following theorem, see [3] for full details. 


Theorem 3. Realizability for separated GR(k) games can be reduced to realiz- 
ability of weak Biichi games. 


The final result on solving Separated GR(k) games is then as follows, see [3] 
for full details. 


Theorem 4. Let (GS,y) be a separated GR(k) game over the input/output 
variables T and O, respectively. Then, the realizability and synthesis problems for 
(GS, p) are solved in O(\y| +N) and O(|y|N) symbolic operations, respectively, 
where N = |27V9|. 


Proof Outline. Realizability and synthesis follow Lemma 2 and Theorem 3. It is 
left to analyze the number of symbolic operations for constructing fg and then 
f. In symbolic operations, constructing acc takes O(|y| + N), and computing 
the winning region W for (GS, GF(acc)) takes O(N). Checking realizability can 
be done by checking if for every initial input 7 there is an initial output o such 
that (i,0) € W, which takes O(1). The winning strategy fg can be computed 
in the process of computing W, taking the same number of operations (see [3 
for details). Finally, constructing firaver takes O((#gars)N) < O(|y|N), where 
gars are all guarantees GF(g;,¢) that appear in y. Therefore, constructing f takes 
O(|y|N) symbolic operations in total. 

Note that this result is an improvement over the complexity of synthesizing 
GR(k) games in general [35]. 
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7 Implementation and Evaluation 


We have implemented our Separated GR(k) framework for realizability and syn- 
thesis in a prototype tool SGR(k). The tool implements our symbolic algorithm 
using the CUDD [39] package for BDD manipulation. Our tool is evaluated on a 
suite of benchmarks created from the examples described in Sect. 4. 


Benchmark Suite. We have created a suite of parametric benchmarks from 
the three examples described in Sect.4. Our suite consists of 38 realizable spec- 
ifications. The parametric versions of the examples are described here. 

The multi-mode hardware example is a generalization of the example pre- 
sented at the beginning of Sect.4. It is parameterized by the number of bits n 
and has 2” modes. The Target can move from mode 0 to any mode and stay 
there, while the Adaptee can only move from mode 0 to odd-numbered modes, 
and up and down between modes 27 and 2i + 1. The specification consists of 2n 
variables. We generate 10 such benchmarks with n € {1,..., 10}. 

The cleaning robots example is parameterized in the number of rooms. For 
a scenario with n rooms, the specification is written over 4n + 1 variables. We 
create 10 such benchmarks with n € {1..., 10}. 

The railways signalling example consists of two parameters: a junction of 
n railways and the frequency parameter m. With parameters n and m, the 
specification consists of (2 + 2/logm])n variables. We generate 18 benchmarks 
with n € {2,...,10} and m E {2,3}. 


Experimental Setup and Methodology. We evaluate our tool against 
Strix [1,31], the current state-of-the-art tool for LTL synthesis and SYNTCOMP 
2020 winner of 3 out of 4 tracks [2]. In order to run our benchmarks on Strix, we 
transform the benchmarks (a game structure and a winning condition) into an 
LTL formula that characterizes the same winning plays using the strict semantics 
from [22]. To the best of our knowledge, there is no other synthesis/realizability 
tool that operates on GR (k) specifications. 

We compare the running time for checking realizability. For this, we compare 
the running time of realizability checks of each benchmark on both tools. Every 
benchmark is tested 10 times on both tools. We do this to account for the 
randomness introduced during BDD construction due to the automatic variable 
ordering by CUDD. For each benchmark we evaluate (a) the number of executions 
on which the tools terminate and (b) the mean running time over 10 executions. 

All experiments were executed on a single node of a high-performance com- 
puter cluster consisting of an Intel Xeon processor running at 2.6 GHz with 
32 GB of memory with a timeout of 10 mins. 


Observations and Inferences. Our experiments clearly demonstrate the scal- 
ability and efficiency of our tool in solving Separated GR(k) formulas. 

Figure 4 plots the mean running time for the three benchmarks. We further 
report the mean values in Table 1. The table rows refer to the benchmarks we 
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Fig. 4. Mean running time for different classes of benchmarks. 


examine, and the columns refer to the value of the parameter n. As an example, 
for the specification Cleaning(3), SGR(k)’s mean running time is 0.07s. (row 
titled Cleaning(n); SGR(k), column titled 3) and Strix’s mean realizability check 
running time is 58.3s. (row titled Cleaning(n);Strix), column titled 4). Cells 


reading ‘TO’ indicate experiments reached a timeout. 


The results show that our tool solves a significantly larger number of bench- 
marks than Strix. On the few benchmarks which Strix solves, our tool outperforms 


Table 1. Mean realizability check running times (sec.) 


8 


n 1 2 3 4 5 6 7 8 9 10 
MultiMode(n) | SGR(k) | 0.06 | 0.05 | 0.05 | 0.06 | 0.06 | 0.08 0.1 | 0.19 | 0.46 | 1.07 
Strix 0.13 |0.29| TO |TO |TO TO TO |TO |TO |TO 
Cleaning(n) SGR(k) | 0.05 | 0.05 | 0.07 | 0.09 | 0.16 | 0.26 0.63 | 1.16 | 1.78 | 2.43 
Strix 0.31 | 0.75 | 58.3|TO |TO |TO |TO |TO | TO | TO 
Railways(n, 2) SGR(k) | - 0.11 | 0.17 | 0.71 | 3.88 | 11.8 | 15.1 | 40.8 | 219 | TO 
Strix - 382 |TO |TO |TO | TO |TO |TO |TO | TO 
Railways(n, 3) SGR(k) | - 0.07 | 0.36 | 1.67 | 8.39 | 29.8 | 50.3 | 102 | TO | TO 
Strix - 381 |TO |TO |TO | TO |TO |TO | TO | TO 
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it by several orders of magnitude. Although the running time may vary depend- 
ing on the automatic variable ordering chosen by CUDD, we do not believe it 
would vary enough to significantly change the results. Specifically, we calculated 
the 99% confidence interval for our results, and validated that for all data points 
our tool’s entire interval lies below the entire interval for Strix. 

Only three benchmarks were unsolvable by our tool (in the sense that the 
majority of the 10 executions timed out). The three benchmarks are the railway 
signal examples with (n = 10,m = 2), (n = 9,m = 3), and (n = 10,m = 3). 
These benchmarks consist of a large number of variables (54, 40, and 60, respec- 
tively), making them particularly challenging. All executions of the remaining 
benchmarks were solved in less than 4 mins by our tool. 

We also examined the number of solved executions per benchmark. Our tool 
solved all 10 executions for 35 out of 38 benchmarks. These are the 35 bench- 
marks that appear as solved in Fig. 4. For the railway signalling benchmark with 
(n = 10,m = 2), our tool solved 2 out of 10 executions. In contrast, Strix was not 
able to solve even one execution for 31 out of 38 benchmarks. Even increasing 
the timeout to 8hrs only allowed Strix to solve a single additional benchmark. In 
total, Strix and our tool verified realizability of 7 benchmarks and 36 out of 38 
benchmarks, respectively. In summary, our experiments demonstrate that our 
tool is able to solve specifications which are challenging for existing tools. 


8 Related Work 


The Adapter design pattern was introduced in [18], and has been used in many 
software contexts since. Our interpretation of the pattern is inspired by automata 
based description of the pattern proposed by Pedrazzini [34]. We reformulated 
the problem as synthesis of reactive controllers that compose with existing sys- 
tems to achieve a temporal specification, e.g. [7,13, 17]. Note that our work differs 
from such frameworks in its variables separation feature. A work with a concept 
similar to adapting behaviors is the Shield synthesis that studies the problem 
in which a synthesized controller corrects safety violations of an existing con- 
troller [24]. Note that in contrast, our problem is mostly concerned about liveness 
adaptation. 

Reactive synthesis of LTL winning conditions is 2EXPTIME complete in the 
size of the formula [37], making it difficult to scale for applications. An approach 
to overcome the computational barrier has been to investigate fragments and 
variants of LTL with lower complexity for synthesis [4,14,16]. One such frag- 
ment is GR(k) [9], that offers a balance between efficiency and expressiveness. 
Specifically, GR(k) games are known to be efficient as they are solved in expo- 
nential time in the number of conjunctions k rather than exponential in the 
state-space [35]. Several studies have also shown that GR(k) specifications are 
highly expressive. As evidence, all properties expressed by deterministic Biichi 
automata (DBA) can be expressed in GR(k) [16], where a study of commonly 
appearing LTL patterns has shown that 52 of 55 patterns are DBA proper- 
ties [15,29]. DBA properties have also been identified as common patterns in 
robotics applications [30]. 
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Finally, Separated GR(k) games exhibit the delay property, which intuitively 
means that the system can win even after delaying its action for a finite amount of 
time while ignoring the environment before “catching up” with the environment. 
While this is reminiscent of asynchrony in reactive systems [6,38], a further 
exploration of relations between asynchrony and the delay property is required. 


9 Conclusion 


This paper presents a reactive systems-based model of the adapter design pat- 
tern. We model the adapters as transducers and reduce the problem of finding 
an Adapter transducer for a given Adaptee and Target systems, to the problem 
of synthesizing strategies for Separated GR(k) games. Through an analysis of 
theoretical complexity and algorithmic performance, we show that realizability 
and synthesis of Separated GR(k) games is efficient and scalable. Furthermore, 
by outperforming Strix, an existing state-of-the-art synthesis tool, we show that 
algorithms for the Separated GR(k) class of specifications add value to the port- 
folio of reactive synthesis tools. 

The benefits of separation of input and output variables were previously 
shown in the context of Boolean Functional Synthesis [11]. Through this work, 
we showed that separation also leads to practically viable solutions in temporal 
reactive synthesis, specifically when encoding the types of equivalence relations 
that appear in reactive adaptation (where properties of runs of the first system 
are compared to properties of runs of the other). Since the systems may be loosely 
coupled, i.e., they may not run on the same clock, specifications that impose 
joint temporal constraints on the two systems may not be realizable. Thus, our 
proposition to use the type of equivalence that separated GR(k) formulas allow, 
gives users the power needed for comparing the overall behaviors of the systems 
while allowing realizability and efficient synthesis. 

The results presented in this paper encourage future studies on the separa- 
tion of variables in a broader context. For instance, reason about variants of 
the adapter design pattern that do not separate variables all the way through. 
That is to say, variants that translate to more general GR(k) specifications in 
which the separation appears in the input and output systems but not in the 
specification itself. One could further study the notion of separation of variables 
in more the general LTL specifications. Another direction is to consider systems 
that gets two types of input: from the input system (i.e. the Target) as well as 
from an environment. We believe that these future directions would enable the 
development of tools for synthesis from temporal specifications with a focus on 
expressing practical applications as well as ensuring scalability and efficiency. 
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Abstract. We present a causality-based algorithm for solving two- 
player reachability games represented by logical constraints. These games 
are a useful formalism to model a wide array of problems arising, e.g., 
in program synthesis. Our technique for solving these games is based on 
the notion of subgoals, which are slices of the game that the reachabil- 
ity player necessarily needs to pass through in order to reach the goal. 
We use Craig interpolation to identify these necessary sets of moves and 
recursively slice the game along these subgoals. Our approach allows us 
to infer winning strategies that are structured along the subgoals. If the 
game is won by the reachability player, this is a strategy that progresses 
through the subgoals towards the final goal; if the game is won by the 
safety player, it is a permissive strategy that completely avoids a sin- 
gle subgoal. We evaluate our prototype implementation on a range of 
different games. On multiple benchmark families, our prototype scales 
dramatically better than previously available tools. 


1 Introduction 


Two-player games are a fundamental model in logic and verification due to their 
connection to a wide range of topics such as decision procedures, synthesis and 
control [1,2,6,7,11,21]. Algorithmic techniques for finite-state two-player games 
have been studied extensively for many acceptance conditions [20]. For infinite- 
state games most problems are directly undecidable. However, infinite state 
spaces occur naturally in domains like software synthesis [34] and cyber-physical 
systems [23], and hence handling such games is of great interest. An elegant clas- 
sification of infinite-state games that can be algorithmically handled, depending 
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on the acceptance condition of the game, was given in [14]. The authors assume a 
symbolic encoding of the game in a very general form. More recently, incomplete 
procedures for solving infinite-state two-player games specified using logical con- 
straints were studied [4,18]. While [4] is based on automated theorem-proving 
for Horn formulas and handles a wide class of acceptance conditions, the work 
in [18] focusses on reachability games specified in the theory of linear arithmetic, 
and uses sophisticated decision procedures for that theory. 

In this paper, we present a novel technique for solving logically represented 
reachability games based on the notion of subgoals. A necessary subgoal is a 
transition predicate that is satisfied at least once on every play that reaches 
the overall goal. It represents an intermediate target that the reachability player 
must reach in order to win. Subgoals open up game solving to the study of cause- 
effect relationships in the form of counterfactual reasoning [28]: If a cause (the 
subgoal) had not occurred, then the effect (reaching the goal) would not have 
happened. Thus for the safety player, a necessary subgoal provides a chance to 
win the game based on local information: If they control all states satisfying 
the pre-condition of the subgoal, then any strategy that in these states picks a 
transition outside of the subgoal is winning. Finding such a necessary subgoal 
may let us conclude that the safety player wins without ever having to unroll 
the transition relation. 

On the other hand, passing through a necessary subgoal is in general not 
enough for the reachability player to win. We call a subgoal sufficient if indeed 
the reachability player has a winning strategy from every state satisfying the 
post-condition of the subgoal. Dual to the description in the preceding para- 
graph, sufficient subgoals provide a chance for the reachability player to win 
the global game as they must merely reach this intermediate target. The two 
properties differ in one key aspect: While necessity of a subgoal only considers 
the paths of the game arena, for sufficiency the game structure is crucial. 

We show how Craig interpolants can be used to compute necessary subgoals, 
making our methods applicable to games represented by any logic that supports 
interpolation. In contrast, determining whether a subgoal is sufficient requires a 
partial solution of the given game. This motivates the following recursive app- 
roach. We slice the game along a necessary subgoal into two parts, the pre-game 
and the post-game. In order to guarantee these games to be smaller, we solve 
the post-game under the assumption that the considered subgoal was bridged for 
the last time. We conclude that the safety player wins the overall game if they 
can avoid all initial states of the post-game that are winning for the reachability 
player. Otherwise, the pre-game is solved subject to the winning condition given 
by the sufficient subgoal consisting of these states. This approach does not only 
determine which player wins from each initial state, but also computes sym- 
bolically represented winning strategies with a causal structure. Winning safety 
player strategies induce necessary subgoals that the reachability player cannot 
pass, which constitutes a cause for their loss. Winning reachability player strate- 
gies represent a sequence of sufficient subgoals that will be passed, providing 
an explanation for the win. All missing proofs for our theoretical results can be 
found in the full version of this paper [3]. 
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The Python-based implementation CABPy of our approach was used to com- 
pare its performance to SIMSYNTH [18], which is, to the best of our knowledge, 
the only other available tool for solving linear arithmetic reachability games. 
Our experiments demonstrate that our algorithm is competitive in many case 
studies. We can also confirm the expectation that our approach heavily benefits 
from qualitatively expressive Craig interpolants. It is noteworthy that like SIM- 
SYNTH our approach is fully automated and does not require any input in the 
form of hints or templates. Our contributions are summarized as follows: 


— We introduce the concept of necessary and sufficient subgoals and show how 
Craig interpolation can be used to compute necessary subgoals (Sect. 4). 

— We describe an algorithm for solving logically represented two-player reacha- 
bility games using these concepts. We also discuss how to compute represen- 
tations of winning strategies in our approach (Sect. 5). 

— We evaluate our approach experimentally through our Python-based tool 
CaABPy, demonstrating a competitive performance compared to the previ- 
ously available tool SIMSYNTH on various case studies (Sect. 6). 


Related Work. The problem of solving linear arithmetic games is addressed 
in [18] using an approach that relies on a dedicated decision procedure for quan- 
tified linear arithmetic formulas, together with a method to generalize safety 
strategies from truncated versions of the game that end after a prescribed number 
of rounds. Other approaches for solving infinite-state games include deductive 
methods that compute the winning regions of both players using proof rules [4], 
predicate abstraction where an abstract controlled predecessor operation is used 
on the abstract game representation [38], and symbolic BDD-based exploration 
of the state space [15]. Additional techniques are available for finite-state games, 
e.g., generalizing winning runs into a winning strategy for one of the players [31]. 

Our notion of subgoal is related to the concept of landmarks as used in 
planning [22]. Landmarks are milestones that must be true on every successful 
plan, and they can be used to decompose a planning task into smaller sub-tasks. 
Landmarks have also been used in a game setting to prevent the opponent from 
reaching their goal using counter-planning [32]. Whenever a planning task is 
unsolvable, one method to find out why is checking hierarchical abstractions for 
solvability and finding the components causing the problem [36]. 

Causality-based approaches have also been used for model checking of multi- 
threaded concurrent programs [24,25]. In our approach, we use Craig interpo- 
lation to compute the subgoals. Interpolation has already been used in similar 
contexts before, for example to extract winning strategies from game trees [16] 
or to compute new predicates to refine the game abstractions [10]. In [18], inter- 
polation is used to synthesize concrete winning strategies from so called winning 
strategy skeletons, which describe a set of strategies of which at least one is 
winning. 
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2 Motivating Example 


Consider the scenario that an expensive painting is displayed in a large exhibition 
room of a museum. It is secured with an alarm system that is controlled via a 
control panel on the opposite side of the room. A security guard is sleeping at 
the control panel and occasionally wakes up to check whether the alarm is still 
armed. To steal the painting, a thief first needs to disable the alarm and then 
reach the painting before the alarm has been reactivated. We model this scenario 
as a two-player game between a safety player (the guard) and a reachability 
player (the thief) in the theory of linear arithmetic. The moves of both players, 
their initial positions, and the goal condition are described by the formulas: 


Int= -rArz=0Ay=0Ap=0Aa=1At=0), 
Guard = -rAr' Aa’ =xAy =yAp =p 
Me =t-1^>d =a)V(t<0At =2)), (sleep or wake up) 
Thief = rAr At =t 
Anxt+l1>a’>ax-lAytl>y >y-1 (move) 
A(z #0Vy #10 a’ =a) (alarm off) 
A(a’ A10Vy #5Va=1 p =p), (steal) 


Goal = -rAp=1. 


The thief’s position in the room is modeled by two coordinates x,y € R with 
initial value (0,0), and with every transition the thief can move some bounded 
distance. Note that we use primed variables to represent the value of variables 
after taking a transition. The control panel is located at (0,10) and the painting 
at (10,5). The status of the alarm and the painting are described by two boolean 
variables a,p € {0,1}. The guard wakes up every two time units, modeled by 
the variable t € R. The variables x, y are bounded to the interval [0,10] and t to 
(0, 2]. The boolean variable r encodes who makes the next move. In the presented 
configuration, the thief needs more time to move from the control panel to the 
painting than the guard will sleep. It follows that there is a winning strategy for 
the guard, namely, to always reactivate the alarm upon waking up. 

Although it is intuitively fairly easy to come up with this strategy for the 
guard, it is surprisingly hard for game solving tools to find it. The main obstacle 
is the infinite state space of this game. Our approach for solving games repre- 
sented in this logical way imitates causal reasoning: Humans observe that in 
order for the thief to steal the painting (i.e., the effect p = 1), a transition must 
have been taken whose source state does not satisfy the pre-condition of (steal) 
while the target state does. Part of this cause is the condition a = 0, i.e., the 
alarm is off. Recursively, in order for the effect a = 0 to happen, a transition 
setting a from 1 to 0 must have occurred, and so on. 

Our approach captures these cause-effect relationships through the notion of 
necessary subgoals, which are essential milestones that the reachability player has 
to transition through in order to achieve their goal. The first necessary subgoal 
corresponding to the intuitive description above is 
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Cı = (Guard V Thief) \p#1Ap' =1. 


In this case, it easy to see that C4 is also a sufficient subgoal, meaning that all 
successor states of Cı are winning for the thief. Therefore, it is enough to solve 
the game with the modified objective to reach those predecessor states of C1 
from which the thief can enforce C1 being the next move (even if it is not their 
turn). Doing so recursively produces the necessary subgoal 


Cz = (Guard V Thief) ^Na #0 ^a =0, 


meaning that some transition must have caused the effect that the alarm is 
disabled. However, C2 is not sufficient which can be seen by recursively solving 
the game spanning from successor states of Co to C1. This computation has an 
important caveat: After passing through C2, it may happen that a is reset to 
l at a later point (in this particular case, this constitutes precisely the winning 
strategy of the safety player), which means that there is no canonical way to 
slice the game along this subgoal into smaller parts. Hence the recursive call 
solves the game from C2 to Cı subject to the bold assumption that any move 
from a = 0 to a’ = 1 is winning for the guard. This generally underapproximates 
the winning states of the thief. Remarkably, we show that this approximation is 
enough to build winning strategies for both players from their respective winning 
regions. In this case, it allows us to infer that moving through C% is always a 
losing move for the thief. However, at the same time, any play reaching Goal 
has to move through C9. It follows that the thief loses the global game. 

We evaluated our method on several configurations of this game, which we 
call Mona Lisa. The results in Sect.6 support our conjecture that the room size 
has little influence on the time our technique needs to solve the game. 


3 Preliminaries 


We consider two-player reachability games defined by formulas in a given logic 
L. We let L(V) be the £-formulas over a finite set of variables V, also called state 
predicates in the following. We call V’ = {v | v € V} the set of primed variables, 
which are used to represent the value of variables after taking a transition. 
Transitions are expressed by formulas in the set L(V UV’), called transition 
predicates. For some formula y € L(V), we denote the substitution of all variables 
by their primed variant by y[V/V’]. Similarly, we define y[V’/V]. 

For our algorithm we will require the satisfiability problem of 2-formulas to 
be decidable and Craig interpolants [13] to exist for any two mutually unsat- 
isfiable formulas. Formally, we assume there is a function Sat : L(V) > B 
that checks the satisfiability of some formula y € L(V) and an unsatisfiability 
check Unsat : L(V) — B. For interpolation, we assume that there is a function 
Interpolate : L(V) x L(V) > L(V) computing a Craig interpolant for mutually 
unsatisfiable formulas: If y,w € L(V) are such that Unsat(y A y) holds, then 
w => Interpolate(y,w) is valid, Interpolate(y, Y) A y is unsatisfiable, and 
Interpolate(y,w) only contains variables shared by y and w. 
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These functions are provided by many modern Satisfiability Modulo Theories 
(SMT) solvers, in particular for the theories of linear integer arithmetic and linear 
real arithmetic, which we will use for all our examples. Note that interpolation is 
usually only supported for the quantifier-free fragments of these logics, while our 
algorithm will introduce existential quantifiers. Therefore, we resort to quantifier 
elimination wherever necessary, for which there are known procedures for both 
linear integer arithmetic and linear real arithmetic formulas [29,33]. 

In order to distinguish the two players, we will assume that a Boolean vari- 
able called r € VY exists, which holds exactly in the states controlled by the 
reachability player. For all other variables v € V, we let D(v) be the domain 
of v, and we define D = U{D(v) | v € V}. In the remainder of the paper, we 
consider the variables V and their domains to be fixed. 


Definition 1 (Reachability Game). A reachability game is defined by a tuple 
G = (Init, Safe, Reach, Goal), where Init € L(V) is the initial condition, Safe € 
L(V UV’) defines the transitions of player SAFE, Reach € L(V U V') defines the 
transitions of player REACH and Goal € L(V) is the goal condition. 

We require the formulas (Safe = ~r) and (Reach => r) to be valid. 


A state s of G is a valuation of the variables V, i.e., a function s: V => D 
that satisfies s(v) € D(v) for all v € V. We denote the set of states by S, and we 
let Sgarz be the states s such that s(r) = false, and Spracy be the states s such 
that s(r) = true. The variable r determines whether REACH or SAFE makes the 
move out of the current state, and in particular Safe A Reach is unsatisfiable. 

Given a state predicate y € L(V), we denote by (s) the closed formula we 
get by replacing each occurrence of variable v € V in y by s(v). Similarly, given 
a transition predicate r € L(VUV’) and states s, s’, we let T(s, s”) be the formula 
we obtain by replacing all occurrences of v € V in 7 by s(v), and all occurrences 
of v’ € V' in T by s‘(v). For replacing only v € V by s(v), we define t(s) € L(V’). 
A trap state of G is a state s such that (Safe V Reach)(s) € £(V’) is unsatisfiable 
(i.e., s has no outgoing transitions). 

A play of G starting in state so is a finite or infinite sequence of states 
pP = 898182... € St US” such that for all i < len(p) either Safe(s;,5;41) or 
Reach(s;, 8:41) is valid, and if p is a finite play, then s).,(,) is required to be 
a trap state. Here, len(sp...s,) = n for finite plays, and len(p) = œœ if p is 
an infinite play. The set of plays of some game G = (Init, Safe, Reach, Goal) 
is defined as Plays(G) = {p = 505152... | pis a play in G s.t. Init(so) holds}. 
REACH wins some play p = S981... if the play reaches a goal state, i.e., if there 
exists some integer 0 < k < len(p) such that Goal(s;,) is valid. Otherwise, SAFE 
wins play p. A reachability strategy op is a function op : S* Speacy > S such that 
if op(ws) = s’ and s is not a trap state, then Reach(s, s’) is valid. We say that 
a play p = 8098182... is consistent with op if for all i such that s;(r) = true 
we have $311 = Or(so...5;). A reachability strategy op is winning from some 
state s if REACH wins every play consistent with og starting in s. We define 
safety strategies og for SAFE analogously. We say that a player wins in or from 
a state s if they have a winning strategy from s. Lastly, REACH wins the game G 
if they win from some initial state. Otherwise, SAFE wins. 
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We often project a transition predicate T onto the source or target states of 
transitions satisfying T, which is taken care of by the formulas Pre( T) = 3V’. T 
and Post(T) = 3V. T. The notation SV (resp. JV’) represents the existential 
quantification over all variables in the corresponding set. Given y € L(V), we 
call the set of transitions in G that move from states not satisfying y, to states 
satisfying y, the instantiation of y, formally: 


Instantiate(y,G) = (Safe V Reach) Any Ag". 


4 Subgoals 


We formally define the notion of subgoals. Let G = (Init, Safe, Reach, Goal) be a 
fixed reachability game throughout this section, where we assume that Init \ Goal 
is unsatisfiable. Whenever this assumption is not satisfied in our algorithm, we 
will instead consider the game G’ = (Init A~ Goal, Safe, Reach, Goal) which does 
satisfy it. As states in Init ^ Goal are immediately winning for REACH, this is not 
a real restriction. 


Definition 2 (Enforceable transitions). The set of enforceable transitions 
relative to a transition predicate T € L(V UV’) is defined by the formula 


Enf(T,G) = (Safe V Reach) \ T A 7AV". (Safe \ aT). 


The enforceable transitions operator serves a purpose similar to the controlled 
predecessors operator commonly known in the literature, which is often used in a 
backwards fixed point computation, called attractor construction [37]. For both 
operations, the idea is to determine controllability by REACH. The main difference 
is that we do not consider the whole transition relation, but only a predetermined 
set of transitions and check from which predecessor states the post-condition of 
the set can be enforced by REACH. These include all transitions in T controlled 
by REACH and additionally transitions in T controlled by SAFE such that all other 
transitions in the origin state of the transition also satisfy T. The similarity with 
the controlled predecessor is exemplified by the following lemma: 


Lemma 3. Let T be a transition predicate, and suppose that all states satisfying 
Post(T)[V’/V] are winning for REACH in G. Then all states in Pre(Enf(T,G)) are 
winning for REACH in G. 


Proof. Clearly, all states in Pre(Enf(T,G)) that are under the control of REACH 
are winning for REACH, as in any such state they have a transition satisfying T 
(observe that Enf(T,G) => T is valid), which leads to a winning state by 
assumption. 

So let s be a state satisfying Pre(Enf(T,G)) that is under the control of SAFE. 
As Pre(Enf(T,G))(s) is valid, s has a transition that satisfies T (in particular, 
s is not a trap state). Furthermore, we know that there is no s’ € S such that 
Safe(s, s’) \aT(s, s’) holds, and hence there is no transition satisfying =T from 
s. Since Post(T)[V’/V] is winning for REACH, it follows that from s player SAFE 
cannot avoid playing into a winning state of REACH. 
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We now turn to a formal definition of necessary subgoals, which intuitively 
are sets of transitions that appear on every play that is winning for REACH. 


Definition 4 (Necessary subgoal). A necessary subgoal C € L(VUV’) for G 
is a transition predicate such that for every play p = sos... of G andn E€ N 
such that Goal(s,) is valid, there exists some k < n such that O (sk, S41) is 
valid. 


Necessary subgoals provide a means by which winning safety player strategies 
can be identified, as formalized in the following lemma. 


Lemma 5. A safety strategy og is winning in G if and only if there exists a 
necessary subgoal C for G such that for all plays p = s9s,... of G consistent 
with og there is non E€ N such that C(Sn,5n41) holds. 


Proof. “ =>”. The transition predicate Goal[V/V’] (i.e., transitions with end- 
points satisfying Goal) is clearly a necessary subgoal. If og is winning for SAFE, 
then no play consistent with og contains a transition in this necessary subgoal. 
“<—” Let C be a necessary subgoal such that no play consistent with og con- 
tains a transition of C. Then by Definition 4 no play consistent with og contains 
a state satisfying Goal. Hence og is a winning strategy for SAFE. 


Of course, the question remains how to compute non-trivial subgoals. Indeed, 
using Goal as outlined in the proof above provides no further benefit over a 
simple backwards exploration (see Remark 15 in the following section). 

Ideally, a subgoal should represent an interesting key decision to focus the 
strategy search. As we show next, Craig interpolation allows to extract partial 
causes for the mutual unsatisfiability of Init and Goal and can in this way provide 
necessary subgoals. Recall that a Craig interpolant y between Init and Goal is 
a state predicate that is implied by Goal, and unsatisfiable in conjunction with 
Init. In this sense, y describes an observable effect that must occur if REACH 
wins, and the concrete transition that instantiates the interpolant causes this 
effect. 


Proposition 6. Let y be a Craig interpolant for Init and Goal. Then the tran- 
sition predicate Instantiate(y,G) is a necessary subgoal. 


Proof. As ¢ is an interpolant, it holds that Goal = > y is valid and Init A p 
is unsatisfiable. Consider any play p = sosı... of G such that Goal(s,) is 
valid for some n € N. It follows that ay(so) and (sn) are both valid. 
Consequently, there is some 0 < i < n such that -(s;) and (5,41) are 
both valid. As all pairs (sx, 5,441) satisfy either Safe or Reach, it follows that 
(Instantiate(p, G)) (si, 8:41) is valid. Hence, Instantiate(y, G) is a necessary sub- 
goal. 


While avoiding a necessary subgoal is a winning strategy for SAFE, reaching a 
necessary subgoal is in general not sufficient to guarantee a win for REACH. This 
is because there might be some transitions in the necessary subgoal that produce 
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the desired effect described by the Craig interpolant, but that trap REACH in a 
region of the state space where they cannot enforce some other necessary effect 
to reach goal. For the purpose of describing a set of transitions that is guaranteed 
to be winning for the reachability player, we introduce sufficient subgoals. 


Definition 7 (Sufficient subgoal). A transition predicate F € L(V U 
V’) is called a sufficient subgoal if REACH wins from every state satisfying 
Post(F’)[V//V]. 


Example 8. Consider the Mona Lisa game G described in Sect. 2. 
Cı = (Guard V Thief) \p#1Ap' =1 


qualifies as sufficient subgoal, because REACH wins from every successor state as 
all those states satisfy Goal. Also, every play reaching Goal eventually passes 
Cı, and hence C; is also necessary. On the other hand, 


Co = (Guard V Thief) \a#0Aa' =0 


is only a necessary subgoal in G, because SAFE wins from some (in fact all) states 
satisfying Post(C2). 


If the set of transitions in the necessary subgoal C that lead to winning states 
of REACH is definable in £ then we call the transition predicate F that defines it 
the largest sufficient subgoal included in C. It is characterized by the properties 
(1) F = C is valid, and (2) if F’ is such that F => F” is valid, then either 
F = F’, or F’ is not a sufficient subgoal. Since C is a necessary subgoal and 
F is maximal with the properties above, REACH needs to see a transition in F 
eventually in order to win. This balance of necessity and sufficiency allows us to 
partition the game along F into a game that happens after the subgoal and one 
that happens before. 


Proposition 9. Let C be a necessary subgoal, and F be the largest sufficient 
subgoal included in C. Then REACH wins from an initial state s in G if and only 
if REACH wins from s in the pre-game 


Gore = (Init, Safe NnF, Reach A ~F, Pre(Enf(F,G))). 


Proof. “ = >”. Suppose that REACH wins in G from s using strategy op. Assume 
for a contradiction that SAFE wins in Gpre from s using strategy og. Consider 
strategy o% such that o4(ws’) = og(ws’) if (Safe \ =F’)(s’) is satisfiable, and 
else a4 (ws’) = o4(ws'), where of is an arbitrary safety player strategy in G. Let 
p = 8981... be the (unique) play of G consistent with both or and o%, where 
So = s. Since og is winning in G and C is a necessary subgoal in G, there must 
exist some m € N such that C(8m,Sm4+41) is valid. Let m be the smallest such 
index. Since F = > C, we know for all 0 < k < m that =F (sx, 5,41) holds. 
Hence, there is the play p’ = s951...5m...iN Gpre consistent with og. The state 
Sm+1 İs winning for REACH in G, as it is reached on a play consistent with the 
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winning strategy or. Hence, we know that F(sm, Sm+1) holds, because F is the 
largest sufficient subgoal included in C. If (Reach A F)(sm, Sm+1) held, we would 
have that Pre(Enf(F,G)(s,,) holds: a contradiction with p’ being consistent with 
og, which we assumed to be winning in Gpre. It follows that (Safe A F’)(8m, Sm+1) 
holds. We can conclude that (Safe \ ~F)(sm) is unsatisfiable (i.e., sm is a trap 
state in Gyre), because in all other cases SAFE plays according to og, which cannot 
choose a transition satisfying F. However, this implies that Pre(Enf(F,G) (5m) 
holds, again a contradiction with p’ being consistent with winning strategy og. 
“<—”, Tf REACH wins in Gpre they have a strategy oR such that every play 
consistent with op reaches the set Pre(Enf(F,G)). As F is a sufficient subgoal, 
the states Post(F) are winning for REACH by definition. It follows by Lemma 3 
that all states satisfying Pre(Enf(F,G)) are winning in G. Combining or with a 
strategy that wins in all these states yields a winning strategy for REACH in G. 


5 Causality-Based Game Solving 


Lemma 9 in the preceding section foreshadows how subgoals can be employed 
in building a recursive approach for the solution of reachability games. Before 
turning to our actual algorithm, we describe a way to symbolically represent 
nondeterministic memoryless strategies. As discussed in [18], there is no ideal 
strategy description language for the class of games we consider. Our approach 
allows us to describe sets of concrete strategies as defined in Sect. 3 with linear 
arithmetic formulas. This framework will prove convenient for strategy synthesis, 
i.e., the computation of winning strategies instead of simply determining the 
winner of the game. 


5.1 Symbolically Represented Strategies 


We will represent strategies for both players using transition predicates G € 
L(V U V’), henceforth called symbolic strategies, where we only require that 
(G => (Safe V Reach)) is valid. A sequence s9...5, € ST is called a play 
prefix if it is a prefix of some play in G, (~Goal)(s;) holds for all 0 < j < n, 
and sn is not a trap state. We say that a play prefix p= so... Sn conforms to a 
symbolic reachability strategy G if for all 7 < n we have that (sj, s;41) holds 
whenever sj € Spracn (and analogously for safety strategies). A play conforms 
to G if all its play prefixes conform to G. We say that G is winning for REACH in s 
if all plays from s that conform to G are winning for REACH and all play prefixes 
89-+-Sn E S” Sreacu from s that conform to G are such that (G A Reach)(s,) 
is satisfiable (and analogously for SAFE). The second condition ensures that the 
player cannot be forced to play a transition outside of G by their opponent while 
the play has not reached a trap state or Goal, and in particular guarantees the 
existence of a concrete strategy (as defined in Sect. 3) conforming to ©. 


Lemma 10. If REACH (SAFE) has a winning symbolic strategy in s, then REACH 
(SAFE) has a concrete winning strategy in s. 
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Proof. Let G by a symbolic winning strategy for REACH. Let op be any reach- 
ability strategy such that for all play prefixes ws € S*Speacy that conform to 
G the formula G(s,7R(ws)) is valid. Such a function is guaranteed to exist, as 
(GA Reach)(s) is satisfiable for all such play prefixes by definition. Furthermore, 
OR is winning as all play prefixes of plays consistent with og conform to G, and 
hence all these plays are winning by assumption. The proof for SAFE is analogous. 


This representation allows us to specify nondeterministic strategies, but clas- 
sical memoryless strategies on finite arenas (specified as a function 7: Sreacu > S 
or Ssare — S) can also be represented in this form using a disjunction over for- 
mulas Aey (v = s(v) Av’ = o(s)(v)) for varying s € S. 

The following lemma shows that a necessary subgoal directly yields a sym- 
bolic strategy for SAFE if the subgoal is, in a certain sense, locally avoidable by 
SAFE. It will be our main tool for synthesizing safety player strategies. 


Lemma 11. Let C be a necessary subgoal for G and suppose that 
Unsat(Enf(C,G)) holds. Then, Safe \ ~C is a winning symbolic strategy for 
SAFE in G. 


5.2 A Recursive Algorithm 


We now describe our algorithm which utilizes necessary subgoals to decompose 
and solve two-player reachability games (Algorithm 1). It is incomplete in the 
sense that it does not return on every input (Sect.5.3 discusses special cases 
with guaranteed termination). If the algorithm returns on input G, it returns 
a triple (R, Greacu, Gsarz), where (1) R is a state predicate characterizing the 
initial states that are winning for REACH in G, (2) Greacy is a symbolic strategy 
for REACH that wins in all initial states satisfying R, and (3) Ggarg is a symbolic 
strategy for SAFE that wins in all initial states satisfying Init \-R. The returned 
safety strategy sare is such that ~Ggarg is a necessary subgoal that SAFE can 
avoid locally in the game G restricted to intial states Init \—R (see Lemma 11). 

Algorithm 1 works as follows. States satisfying Init and Goal are immediately 
winning for REACH and thus always part of the returned formula R. Following 
the discussion at the beginning of Sect. 4, further analysis considers the game 
starting in the remaining initial states I = Init A =Goal. If there is no such 
state, we may return that all initial states are winning (line 5). Here, REACH 
wins from R without playing any move, and hence Gpracy = false is a valid 
winning symbolic strategy (winning symbolic strategies are only required to 
provide moves in prefixes that have not seen Goal so far). We may choose Gsare 
arbitrarily as there is no initial state winning for SAFE. 

If the algorithm does not return in line 5, a necessary subgoal C between J and 
Goal is computed by instantiating a Craig interpolant y for the two predicates 
(lines 6 and 7, see also Proposition 6). We break up the remaining description of 
the algorithm into three parts, which correspond to the main cases that occur 
when splitting the game along the subgoal C. 
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Algorithm 1: Reach(G) 
In : reachability game G = (Init, Safe, Reach, Goal) 
Out: triple (R, Greacu, Gsarz) S-t. 
— R € L(V) represents the set of initial states winning for REACH; 
— Gnpeacy is a winning symbolic reachability strategy for states in R; 
— sarr is a winning symbolic safety strategy for states in Init A œR. 


1 begin 

2 R — Init ^ Goal 

3 I — Init \ =Goal 

4 if Unsat(Z) then 

5 | return R, false, false 


6 p — Interpolate(/, Goal) 

7 C <— Instantiate(y, G) 

8 if Unsat(Enf(C,G)) then 

9 | return R, false, Safe \ aC 

10 Gpost — (Post(C)[V'/V], Safe A p, Reach A p, Goal) 
11 | Rost, Skeacns gare — Reach (Gpost) 

12 F — C A Rpost[V/V'] 

13 if Unsat(Enf(F,G)) then 


14 | return R, false, Safe\“F A (p => Spret) 
15 if Sat((Reach V Safe) A p Any’ A aGoal) then 

16 F — FV Goal[v/V’) 

17 | pe false 

18 Gpre — (I, Safe NnF, Reach \ nF, Pre(Enf(F’,G))) 
19 Rpre, Oxzacu> Sgarz Reach (Gore) 

20 return RV Rpre, 

21 combine(GRescy, F, Ghesc), 

22 Oy = Sse) \(p => Osie) 


Case 1: SAFE can avoid the subgoal C. If the necessary subgoal C qualifies 
for Lemma 11, we can immediately conclude that SAFE is winning for all states 
statisfying I (lines 8 and 9). An instance of this case occurs if the interpolant 
describes a bottleneck in the game which is fully controlled by SAFE. The winning 
symbolic reachability strategy is Safe A —C in this case (line 9), and we will 
assume that safety strategies returned by recursive calls of the algorithm are 
essentially negations of necessary subgoals that can be avoided by SAFE. 

If Lemma 11 is not applicable, we next find those transitions in C that move 
into a winning state for the safety player. This is achieved by analyzing the 
post-game (line 10): 
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Gpost = (Post(C)[V’/V], Safe A p, Reach A p, Goal). 


Its initial states are exactly the states one sees after bridging the subgoal C. In 
order to make sure that Gpost is, in some sense, easier to solve than G, we restrict 
both Safe and Reach to y, which is the interpolant used to compute the subgoal 
C. This has the effect of removing all transitions in states not satisfying y, 
making them trap states. For the safety player this makes Gpost easier to win 
than G as all plays ending in such a trap state without seeing Goal before are 
winning for SAFE in Gyos¢. Hence we formally have: 


Lemma 12. If G is a winning symbolic reachability strategy from s in Gpost, 
then © is also winning from s in G. 


Due to the restriction to y, intuitively REACH wins from a state s in Gpost if 
they can win from s in G while staying inside the interpolant p. In other words, 
REACH must guarantee that the necessary subgoal C is not visited again in the 
play. Still, the set Rpost, as returned in line 11 by the recursive call to Algorithm 1 
ON Gpost, is a sufficient subgoal in G, by the above lemma. Furthermore, if SAFE 
can avoid all states satisfying Rpost (see line 13), then this also implies a winning 
strategy from all initial states in J. The reason is that REACH can only win by 
eventually visiting a state from which they can win without leaving y again, as 
(Goal => y) is valid. This is not possible if SAFE can avoid all states in Rpost. 

In this case we construct Ggarg as follows. We assume that ~G2°s" is a nec- 
essary subgoal that can be locally avoided in Gyos¢ from all states satisfying 
Post(C)[V’/V] A =Rpost, and furthermore, we know that F := CA Rposi[V/V’] 
can be locally avoided in G (line 13). Intuitively, playing according to GE$ in 
Gpost Yields a strategy for SAFE which avoids Goal and may move back into a 
state satisfying ~y, which forces REACH to bridge the subgoal C again in order 
to win. It follows that F V (y A ~G£est) is a necessary subgoal from J that 
can be locally avoided by SAFE in G, and the corresponding symbolic strategy 
is Safe =F A (p => G28!) (we additionally intersect the negated neces- 
sary subgoal with Safe to ensure that the symbolic strategy only includes legal 
transitions). 

So far, the subgoal was such that SAFE could avoid it entirely, or at least 
avoid all states from which REACH would win when forced to remain inside the 
post-game. If this is not the case, then we also need to consider the pre-game 
(line 18): 

Gore = (I, Safe NnF, Reach \ =F, Pre(Enf(F',G))). 


which intuitively describes the game before bridging the interpolant C for the 
last time. The exact definition of F will depend on whether C perfectly partitions 
the game or not. In both cases F will be the largest sufficient subgoal contained 
in a necessary subgoal, which lets us apply Proposition 9 to conclude that the 
initial winning regions of G and Gpre coincide. 


Case 2: The Subgoal Perfectly Partitions G. We say that y perfectly parti- 
tions G if (ReachV Safe) \pA-7y' \7Goal is unsatisfiable (cf. line 15). Intuitively, 
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this means that there is no transition that “undoes” the effect of the subgoal C. If 
this holds, then the restriction of Gpost to states satisfying y is de facto no longer 
a restriction, as no play can reach such a state anyway after passing through the 
subgoal. This intuition is formalized by the following lemma. 


Lemma 13. Assume that y perfectly partitions G, and let s be a state satisfying 
Post(C)[V’/V]. Then REACH wins from s in Gpost if and only if REACH wins from 
s ing. 


It follows that F = CA Rpost|V/V'] is the largest sufficient subgoal included 
in C. By Proposition 9, the same initial states are winning for REACH in Gpre 
and in G. In this case, we construct the desired safety strategy (line 22) as 


Gsare = (y => Ghare) ^A (e => Shire) 


where =627$/P°st are assumed to be necessary subgoals avoidable by SAFE in the 
corresponding subgames. Intuitively, the combined strategy consists of following 
Gzire as long as one remains in the pre-game, which, by induction hypothesis, 
allows SAFE to avoid all transitions from F if starting in Rpre. If the play crosses 
CA -F, the strategy is to play according to the winning strategy of the post- 
game. 
A symbolic strategy for REACH can be given by combining pre- and post- 
strategies as follows (line 21): 
combine(Gxeacu, F, Ghercn) = Pre(Ghercn) = Gkercn) 
~ Pre(Gker) A Pre(F)) => F) 
-= Pre(Gkercn) ^A ` Pre(F)) => Ghescn) 


Pre( Sheron) V Pre(F) V Pre(Gkeacu))- 


This represents a nested conditional strategy that prefers the strategies of the 
subgames in the priority order G2er,, F, and finally Gffoy. The reason for this 
order is that the winning condition in the post-game coincides with the global 
winning objective (to reach Goal), while in the pre-game REACH tries to reach a 
winning state in the post-game. The set F is exactly the bridge between these 
two. The last condition makes sure that the strategy only includes transitions of 
states in which it is winning. 


Case 3: The subgoal does not perfectly partition G. If Sat((Reach V 
Safe) A p Any’ A aGoal) is true in line 15, we can no longer assume that F is 
the largest sufficient subgoal in C. The reason is that SAFE may win in Gpost by 
moving out of the subgame, but if this move leads to a winning state for REACH in 
G, then such a strategy is winning in Gpost, but not in G. So we can only assume 
that F is sufficient (this follows by Lemma 12). In order to apply Proposition 
9 we extend F by all transitions that move directly into Goal (line 16). This 
immediately yields a necessary and sufficient subgoal, and so again Proposition 
9 applies to Gpre (line 18). We could have also added Goal-states to F in Case 
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2, but we have observed that not doing so improves the performance of our 
procedure considerably. 

The reachability strategy is composed of Ger. F, and G2", exactly as in 
Case 2 (line 21). As all transitions in F are losing for SAFE, and these are the only 
ones that are removed in Gpre, essentially SAFE can play using the same strategies 
in G and Gyre. We implement this by setting ọ to false (line 17), in which case 
Gsare (line 22) equals (true —> G5.) A(false —> Gop) = GEIS. 

Finally, we formally state the partial correctness of the algorithm, using the 
ideas from above. 


Theorem 14 (Partial correctness). If Reach(G) returns (R, Greacu, © sare), 
then 


- R characterizes the set of initial states that are winning for REACH in G, 
- Greace is a winning symbolic reachability strategy from R, 
- Ggare is a winning symbolic safety strategy from Init \ AR. 


Remark 15 (Simulating the attractor). Note that Craig interpolants are 
by no means unique. If we choose the interpolation function so that 
Interpolate(I, Goal) always returns Goal (this is a valid interpolant), Algo- 
rithm 1 essentially simulates the attractor. In this case the subgoal C (line 7) 
contains exactly the transitions that move directly into Goal. All states in 
Post(C)[V’/V] are then winning for REACH and hence Rpost would be equiva- 
lent to Post(C)[V’/V], which implies that C = F holds in this case. The new 
goal states in Gpre are set to Pre(Enf(F,G)), which are exactly the states in 
Pre(C) that either are controlled by REACH, or such that all their transitions are 
included in F. Hence the set Pre(Enf(F’,G)) is exactly the classical controlled 
predecessor. 


One effect of slicing the game along general subgoals is that the initial pred- 
icate of the post-game (which describes all states satisfying the post-condition 
of the subgoal) may be satisfied by many states that do not necessarily need 
to be considered in order to decide who wins from the initial states of G (for 
example, because they are not reachable from any initial state, or cannot reach 
Goal). This can be a drawback if the (superfluous) size of the subgames makes 
them hard to solve. Notably, this is in general less of an issue for approaches 
based on unrolling of the transition relation: The method of solving increasingly 
large step-bounded games [18] will only consider states that are reachable from 
Init, while backwards fixpoint computations will not explore states that do not 
reach Goal. A way of coping with this is to provide additional information on the 
domains of variables, whenever this is available (we discuss the effect of bounding 
variable domains in Sect. 6). Indeed, in the case where all variable domains are 
finite, Algorithm 1 is guaranteed to terminate, as shown in the next subsection. 


5.3 Special Cases with Guaranteed Termination 


Deciding the winner in the types of games we consider is generally undecid- 
able (see [18] for the case that £ is linear real arithmetic). Since Algorithm 1 


Causality-Based Game Solving 909 


returns a correct result whenever it terminates, this implies that it cannot always 
terminate. In this section, we give two important cases in which we can prove 
termination. 


Theorem 16. If the domains of all variables in G are finite, then Reach(G) 
terminates. 


Remark 17 (Time complexity). The termination argument in the proof yields 
a single-exponential upper bound on the runtime of the algorithm, where the 
input size is measured in the number of concrete transitions of the game. This 
is because in both recursive calls the subgames may be “almost” as large as the 
input — they are only guaranteed to have at least one concrete transition less. 


We now show that, under certain assumptions, our algorithm also terminates 
for games that have a finite bisimulation quotient. To this end, we first clarify 
what bisimilarity means in our setting. A relation R C S x S over the states of 
G is called a bisimulation on G, if 


— for all (s1,s2) E€ R the formulas Goal(s;) <= > Goal(s2), Init(s;) <> 
Init(sz) and r(s,) <= r(s2) are valid (recall that r holds exactly in states 
controlled by REACH). 

— for all (81,52) E€ R and si € S such that (Safe V Reach)(s1,s/,) holds, there 
exists s4 € S such that (Safe V Reach)(s2, 84) holds, and (s4, 85) € R. 

— for all (51,52) € R and s4 € S such that (Safe V Reach)(s2, 55) holds, there 
exists s| E€ S such that (Safe V Reach)(s1, s1) holds, and (s4, 84) € R. 


We say that sı and sg are bisimilar (denoted by sı ~ s2) if there exists a bisim- 
ulation R such that (51,52) € R. Bisimilarity is an equivalence relation, and 
it is the coarsest bisimulation on G. The equivalence classes are called bisim- 
ulation classes. As the winning region of any player can be expressed in the 
p-calculus [39] and the p-calculus is invariant under bisimulation [9], it follows 
that bisimilar states are won by the same player. 


Lemma 18. Let R be a bisimulation on G. If (81,52) € R, then REACH wins 
from sı in G if and only if REACH wins from sa in G. 


We will assume that for each bisimulation class S; there exists a formula 
wi E€ L(V) that defines Si, formally: For all s € S, w;(s) holds if and only if 
s € Si. Furthermore, we will assume that the interpolation procedure respects ~, 
formally: Interpolate(y, Y) is equivalent to a disjunction of formulas 7);. Such an 
interpolant exists if w or y already satisfy this assumption. 


Theorem 19. Let G be a reachability game with finite bisimulation quotient 
under ~ and assume that all bisimulation classes of G are definable in L. Fur- 
thermore, assume that Interpolate respects ~. Then, Reach(G) terminates. 
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6 Case Studies 


In this section we evaluate our approach on a number of case studies. Our pro- 
totype CABPyY? is written in Python and implements the game solving part of 
the presented algorithm. Extending it to returning a symbolic strategy using the 
ideas outlined above is straightforward. We compared our prototype with SIM- 
SYNTH [18], the only other readily available tool for solving linear arithmetic 
games. The evaluation was carried out with Ubuntu 20.04, a 4-core Intel® 
Core™ i5 2.30GHz processor, as well as 8GB of memory. CABPy uses the 
PySMT [19] library as an interface to the MathSAT5 [12] and Z3 [30] SMT 
solvers. On all benchmarks, the timeout was set to 10min. In addition to the 
winner, we report the runtime and the number of subgames our algorithm visits. 
Both may vary with different SMT solvers or in different environments. 


6.1 Game of Nim 


Game of Nim is a classic game from the literature [8] and played on a number of 
heaps of stones. Both players take turns of choosing a single heap and removing 
at least one stone from it. We consider the version where the player that removes 
the last stone wins. Our results are shown in Fig. 1. In instances with three heaps 
or more we bounded the domains of the variables in the instance description, by 
specifying that no heap exceeds its initial size and does not go below zero. 

Following the discussion in Sect. 5.3, we need to bound the domains to ensure 
the termination of our tool on these instances. Remarkably, bounding the vari- 
ables was not necessary for instances with only two heaps, where our tool CABPY 
scales to considerably larger instances than SIMSYNTH. We did not add the same 
constraints to the input of SIMSYNTH, as for SIMSYNTH this resulted in longer 
runtimes rather than shorter. In Game of Nim, there are no natural necessary 
subgoals that the safety player can locally control. 

The results (see Fig. 1) demonstrate that our approach is not completely 
dependent on finding the right interpolants and is in particular also competitive 
when the reachability player wins the game. We suspect that SIMSYNTH performs 
worse in these cases because the safety player has a large range of possible moves 
in most states, and inferring the win of the reachability player requires the tool 
to backtrack and try our all of them. 


6.2 Corridor 


We now consider an example that demonstrates the potential of our method in 
case the game structure contains natural bottlenecks. Consider a corridor of 100 
rooms arranged in sequence, i.e., each room i with 0 < i < 100 is connected 
to room 7+ 1 with a door. The objective of the reachability player is to reach 


? The source code of CABPY and our experimental data are both available at 
https://github.com/reactive-systems/cabpy. We provide a virtual machine image 
with CABPy already installed for reproducing our evaluation [35]. 


Fig. 1. Experimental results for the Game of Nim. The notation (hi,..., 
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CaBPy SIMSYNTH 

Heaps Subgames | Time(s) | Time(s) | Winner 
(4,4) 19 1.50 10.44 REACH 
(4,5) 23 1.92 12.74 SAFE 
(5,5) 23 1.99 85.75 REACH 
(5,6) 27 2.90 91.66 SAFE 
(6,6) 28 3.04 Timeout REACH 
(6,7) 31 3.76 Timeout SAFE 
(20,20) 88 94.85 Timeout REACH 
(20,21) 94 113.04 | Timeout SAFE 
(30,30) 128 364.13 Timeout REACH 
(30,31) 135 404.02 Timeout SAFE 
(3,3,3)b 23 13.63 2.85 SAFE 
(1,4,5)b 32 7.00 289.85 REACH 
(4,4,4)b 33 50.55 24.39 SAFE 
(2,4,6)b 38 19.77 Timeout REACH 
(5,5,5)b 33 127.89 162.50 SAFE 
(3,5,6)b 40 86.56 Timeout REACH 
(2,2,2,2)b 39 84.79 213.79 REACH 
(2,2,2,3)b 41 102.01 Timeout SAFE 
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hn) denotes 


the instance played on n heaps, each of which consists of h; stones. Instances marked 
with b indicate that the variable domains were explicitly bounded in the input for 


CABPY. 


CABPY SIMSYNTH 
r | Subgames | Time(s) | Time(s) | Winner 
10 10 0.57 3.93 SAFE 
20 20 1.23 20.48 SAFE 
40 40 3.42 121.96 SAFE 
60 60 7.36 Timeout SAFE 
80 80 17.72 Timeout SAFE 
100 100 26.36 Timeout SAFE 


Fig. 2. Experimental results for the Corridor game. The safety player controls the door 
between rooms r — 1 and r. 


room 100 and they are free to choose valid values from R? for the position in 
each room at every other turn. The safety player controls some door to a room 
r < 100. Naturally, a winning strategy is to prevent the reachability player from 
passing that door, which is a natural bottleneck and necessary subgoal on the 
way to the last room. 

The experimental results are summarized in Fig. 2. We evaluated several ver- 
sions of this game, increasing the length from the start to the controlled door. 
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The results confirm that our causal synthesis algorithm finds the trivial strategy 
of closing the door quickly. This is because Craig interpolation focuses the sub- 
goals on the room number variable while ignoring the movement in the rooms 
in between, as can be seen by the number of considered subgames. SIMSYNTH, 
which tries to generalize a strategy obtained from a step-bounded game, strug- 
gles because the tool solves the games that happen between each of the doors 
before reaching the controlled one. 


6.3 Mona Lisa 


The game described in Sect.2 between a thief and a security guard is very well 
suited to further assess the strength and limitations of both our approach as well 
as of SIMSYNTH. We ran several experiments with this scenario, scaling the size 
of the room and the sleep time of the guard, as well as trying a scenario where 
the guard does not sleep at all. Scaling the size of the room makes it harder 
for SIMSYNTH to solve this game with a forward unrolling approach, while our 
approach extracts the necessary subgoals irrespective of the room size. However, 
scaling the guard’s sleep time makes it harder to solve the subgame between 
the two necessary subgoals, while it only has a minor effect on the length of the 
unrolling needed to stabilize the play in a safe region, as done by SIMSYNTH. 

The results in Fig. 3 support this conjecture. The size of the room has almost 
no effect at all on both the runtime of CABPy and the number of considered 
subgames. However, as the results for a sleep value of 4 show, the employed com- 
bination of quantifier elimination and interpolation introduces some instability 
in the produced formulas. This means we may get different Craig interpolants 
and slice the game with more or less subgoals. Therefore, we see a lot of potential 
in optimizing the interplay between the employed tools for quantifier elimina- 
tion and interpolation. The phenomenon of the runtime being sensitive to these 
small changes in values is also seen with SIMSYNTH, where a longer sleep time 
sometimes means a faster execution. 


6.4 Program Synthesis 


Lastly, we study two benchmarks that are directly related to program synthesis. 
The first problem is to synthesize a controller for a thermostat by filling out an 
incomplete program, as described in [4]. A range of possible initial values of the 
room temperature c is given, e.g., 20.8 < c < 23.5, together with the temperature 
dynamics which depend on whether the heater is on (variable o € B). The 
objective for SAFE is to control the value of o in every round such that c stays 
between 20 and 25. This is a common benchmark for program synthesis tools 
and both CaBPy and SIMSYNTH solve it quickly (see Fig. 4). The other problem 
relates to Lamport’s bakery algorithm [26]. We consider two processes using this 
protocol to ensure mutually exclusive access to a shared resource. The game 
describes the task of synthesizing a scheduler that violates the mutual exclusion. 
This essentially is a model checking problem, and we study it to see how well 
the tools can infer a safety invariant that is out of control of the safety player. 
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CaBPy SIMSYNTH 
Size | Sleep | Subgames | Time(s) | Time(s) | Winner 
10 x 10 - T 0.61 4.79 SAFE 
20 x 20 - T 0.60 25.26 SAFE 
40 x 40 - 7 0.61 157.62 SAFE 
10 x 10 1 10 4.22 20.31 SAFE 
20 x 20 1 11 4.34 36.44 SAFE 
40 x 40 1 11 4.65 226.14 SAFE 
10 x 10 2 13 5.88 7.40 SAFE 
20 x 20 2 14 5.98 60.00 SAFE 
40 x 40 2 13 5.92 270.48 SAFE 
10 x 10 3 18 26.58 13.94 SAFE 
20 x 20 3 17 26.19 115.53 SAFE 
40 x 40 3 18 27.85 290.12 SAFE 
10 x 10 4 30 175.27 13.96 SAFE 
20 x 20 4 22 204.79 60.08 SAFE 
40 x 40 4 27 123.95 319.47 SAFE 


Fig. 3. Experimental results for the Mona Lisa game. 


CaBPY SIMSYNTH 
Name Subgames | Time(s) | Time(s) | Winner 
Thermostat 6 0.44 0.39 SAFE 
Bakery 46 18.25 Timeout SAFE 


Fig. 4. Experimental results for program synthesis problems. 


For our approach, this makes no difference, as both players may play through a 
subgoal and the framework is well suited to find a safety invariant. The forward 
unrolling approach of SIMSYNTH, however, seems to explore the whole state 
space before inferring safety, and fails to find an invariant before a timeout. 


7 Conclusion 


Our work is a step towards the fully automated synthesis of software. It tar- 
gets symbolically represented reachability games which are expressive enough 
to model a variety of problems, from common game benchmarks to program 
synthesis problems. The presented approach exploits causal information in the 
form of subgoals, which are parts of the game that the reachability player needs 
to pass through in order to win. Having computed a subgoal, which can be done 
using Craig interpolation, the game is split along the subgoal and solved recur- 
sively. At the same time, the algorithm infers a structured symbolic strategy 
for the winning player. The evaluation of our prototype implementation CABPY 
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shows that our approach is practically applicable and scales much better than 
previously available tools on several benchmarks. While termination is only guar- 
anteed for games with finite bisimulation quotient, the experiments demonstrate 
that several infinite games can be solved as well. 

This work opens up several interesting questions for further research. One 
concerns the quality of the returned strategies. Due to its compositional nature, 
at first sight it seems that our approach is not well-suited to handle global 
optimization criteria, such as reaching the goal in fewest possible steps. On the 
other hand, the returned strategies often involve only a few key decisions and 
we believe that therefore the strategies are often very sparse, although this has 
to be further investigated. We also plan to automatically extract deterministic 
strategies from the symbolic ones [5,17] we currently consider. 

Another question regards the computation of subgoals. The performance of 
our algorithm is highly influenced by which interpolant is returned by the solver. 
In particular this affects the number of subgames that have to be solved, and 
how complex they are. We believe that template-based interpolation [27] is a 
promising candidate to explore for computing good interpolants. This could be 
combined with the possibility for the user to provide templates or expressive 
interpolants directly, thereby benefiting from the user’s domain knowledge. 


References 


1. Alur, R., Moarref, S., Topcu, U.: Pattern-based refinement of assume-guarantee 
specifications in reactive synthesis. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. 
LNCS, vol. 9035, pp. 501-516. Springer, Heidelberg (2015). https://doi.org/10. 
1007/978-3-662-46681-0_49 

2. Alur, R., Moarref, S., Topcu, U.: Compositional synthesis of reactive controllers 
for multi-agent systems. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, 
vol. 9780, pp. 251-269. Springer, Cham (2016). https://doi.org/10.1007/978-3-319- 
41540-6_14 

3. Baier, C., Coenen, N., Finkbeiner, B., Funke, F., Jantsch, S., Siber, J.: Causality- 
based game solving. CoRR (2021). https://arxiv.org/abs/2105.14247, long version 
with appendix 

4. Beyene, T., Chaudhuri, S., Popeea, C., Rybalchenko, A.: A constraint-based app- 
roach to solving games on infinite graphs. In: Principles of Programming Languages 
(POPL). ACM, New York (2014). https://doi.org/10.1145/2535838.2535860 

5. Bloem, R., Egly, U., Klampfl, P., Könighofer, R., Lonsing, F.: SAT-based methods 
for circuit synthesis. In: Formal Methods in Computer-Aided Design (FMCAD). 
IEEE (2014). https://doi.org/10.1109/FMCAD.2014.6987592 

6. Bloem, R., Galler, S., Jobstmann, B., Piterman, N., Pnueli, A., Weiglhofer, M.: 
Specify, compile, run: hardware from PSL. Electron. Notes Theor. Comput. Sci. 
190(4), 3-16 (2007). https://doi.org/10.1016/j.entcs.2007.09.004 

7. Bloem, R., Jobstmann, B., Piterman, N., Pnueli, A., Sa’ar, Y.: Synthesis of reac- 
tive(1) designs. J. Comput. Syst. Sci. 78(3), 911-938 (2012). https://doi.org/10. 
1016/j.jcss.2011.08.007. In Commemoration of Amir Pnueli 

8. Bouton, C.L.: Nim, a game with a complete mathematical theory. Ann. Math. 
3(1/4), 35-39 (1901). https://doi.org/10.2307/1967631 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


Causality-Based Game Solving 915 


Bradfield, J.C., Stirling, C.: Modal mu-calculi. In: Blackburn, P., van Benthem, 
J.F.A.K., Wolter, F. (eds.) Handbook of Modal Logic, Studies in Logic and Practi- 
cal Reasoning, vol. 3, pp. 721-756, North-Holland (2007). https://doi.org/10.1016/ 
s1570-2464(07)80015-2 

Brückner, I., Dräger, K., Finkbeiner, B., Wehrheim, H.: Slicing abstractions. In: 
Arbab, F., Sirjani, M. (eds.) FSEN 2007. LNCS, vol. 4767, pp. 17-32. Springer, 
Heidelberg (2007). https://doi.org/10.1007/978-3-540-75698-9_2 

Chen, Y., Tamovu, J., Belta, C.: LTL Robot Motion Control based on Automata 
Learning of Environmental Dynamics. In: International Conference on Robotics 
and Automation. IEEE (2012). https://doi-org/10.1109/ICRA.2012.6225075 
Cimatti, A., Griggio, A., Schaafsma, B.J., Sebastiani, R.: The MathSAT5 SMT 
solver. In: Piterman, N., Smolka, S.A. (eds.) TACAS 2013. LNCS, vol. 7795, pp. 
93-107. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36742-7_7 
Craig, W.: Three uses of the Herbrand-Gentzen theorem in relating model theory 
and proof theory. J. Symbolic Logic 22(3), 269-285 (1957). https://doi.org/10. 
2307/2963594 

de Alfaro, L., Henzinger, T.A., Majumdar, R.: Symbolic algorithms for infinite- 
state games. In: Larsen, K.G., Nielsen, M. (eds.) CONCUR 2001. LNCS, vol. 2154, 
pp. 536-550. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44685- 
0-36 

Edelkamp, S.: Symbolic exploration in two-player games: preliminary results. In: 
The International Conference on AI Planning & Scheduling (AIPS), Workshop on 
Model Checking (2002) 

Eén, N., Legg, A., Narodytska, N., Ryzhyk, L.: SAT-based strategy extraction 
in reachability games. In: Conference on Artificial Intelligence (AAAI) (2015). 
https://ojs.aaai.org/index.php/AAAI/article/view/9752 

Ehlers, R., Moldovan, D.: Sparse positional strategies for safety games. In: Workshop 
on Synthesis (SYNT), EPTCS (2012). https://doi.org/10.4204/EPTCS.84.1 
Farzan, A., Kincaid, Z.: Strategy synthesis for linear arithmetic games. Proc. ACM 
Program. Lang. 2(POPL) (2017). https: //doi.org/10.1145/3158149 

Gario, M., Micheli, A.: PySMT: a solver-agnostic library for fast prototyping of 
SMT-based algorithms. In: SMT Workshop 2015 (2015) 

Gradel, E., Thomas, W., Wilke, T. (eds.): Automata Logics, and Infinite Games. 
LNCS, vol. 2500. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540- 
36387-4 

Harding, A., Ryan, M., Schobbens, P.-Y.: A new algorithm for strategy synthe- 
sis in LTL games. In: Halbwachs, N., Zuck, L.D. (eds.) TACAS 2005. LNCS, vol. 
3440, pp. 477—492. Springer, Heidelberg (2005). https: //doi.org/10.1007/978-3-540- 
31980-1_31 

Hoffmann, J., Porteous, J., Sebastia, L.: Ordered landmarks in planning. J. Artif. 
Intell. Res. 22(1), 215-278 (2004) 

Jessen, J.J., Rasmussen, J.I., Larsen, K.G., David, A.: Guided controller synthe- 
sis for climate controller using UPPAAL TIGA. In: Raskin, J.-F., Thiagarajan, P.S. 
(eds.) FORMATS 2007. LNCS, vol. 4763, pp. 227-240. Springer, Heidelberg (2007). 
https: //doi.org/10.1007/978-3-540-75454-1_17 

Kupriyanov, A., Finkbeiner, B.: Causality-based verification of multi-threaded pro- 
grams. Concur. Theory (CONCUR) (2013). https://doi.org/10.1007/978-3-642- 
40184-8_19 

Kupriyanov, A., Finkbeiner, B.: Causal termination of multi-threaded programs. 
Comput. Aided Verification (CAV) (2014). https://doi.org/10.1007/978-3-319- 
08867-9_54 


916 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


C. Baier et al. 


Lamport, L.: A new solution of Dijkstra’s concurrent programming problem. Com- 
mun. ACM 17(8), 453-455 (1974). https://doi.org/10.1145/361082.361093 
Leroux, J., Rümmer, P., Subotić, P.: Guiding Craig interpolation with domain- 
specific abstractions. Acta Informatica 53(4), 387—424 (2016). https://doi.org/10. 
1007 /s00236-015-0236-z 

Menzies, P., Beebee, H.: Counterfactual theories of causation. In: Zalta, E.N. 
(ed.) The Stanford Encyclopedia of Philosophy. Stanford University, Metaphysics 
Research Lab (2020) 

Monniaux, D.: A quantifier elimination algorithm for linear real arithmetic. In: 
Cervesato, I., Veith, H., Voronkov, A. (eds.) LPAR 2008. LNCS (LNAI), vol. 
5330, pp. 243-257. Springer, Heidelberg (2008). https: //doi.org/10.1007/978-3-540- 
89439-1_18 

de Moura, L., Bjgrner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., 
Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337-340. Springer, Heidelberg 
(2008). https: //doi.org/10.1007/978-3-540-78800-3_24 

Narodytska, N., Legg, A., Bacchus, F., Ryzhyk, L., Walker, A.: Solving games with- 
out controllable predecessor. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, 
vol. 8559, pp. 533-540. Springer, Cham (2014). https://doi.org/10.1007/978-3-319- 
08867-9_35 

Pozanco, A., E-Martin, Y., Fernández, S., Borrajo, D.: Counterplanning using goal 
recognition and landmarks. In: International Joint Conference on Artificial Intelli- 
gence (IJCAI) (2018). https://doi.org/10.24963/ijcai.2018/668 

Presburger, M.: Über die Vollständigkeit eines gewissen Systems der Arithmetik 
ganzer Zahlen, in welchem die Addition als einzige Operation hervortritt. Comptes 
Rendus du I congres de Mathématiciens des Pays Slaves (1929) 

Ryzhyk, L., Chubb, P., Kuz, I., Le Sueur, E., Heiser, G.: Automatic device driver syn- 
thesis with termite. In: Symposium on Operating Systems Principles (SOSP). Asso- 
ciation for Computing Machinery (ACM) (2009). https: //doi.org/10.1145/1629575. 
1629583 

Siber, J.: The Virtual Machine containing CabPy (2021). https://doi.org/10.6084/ 
m9.figshare.14493804.v3 

Sreedharan, S., Srivastava, S., Smith, D.E., Kambhampati, S.: Why can’t you do 
that HAL? Explaining unsolvability of planning tasks. In: International Joint Con- 
ference on Artificial Intelligence (IJCAI) (2019). https: //doi.org/10.24963 /ijcai. 
2019/197 

Thomas, W.: On the synthesis of strategies in infinite games. In: Mayr, E.W., 
Puech, C. (eds.) STACS 1995. LNCS, vol. 900, pp. 1-13. Springer, Heidelberg (1995). 
https: //doi.org/10.1007 /3-540-59042-0_57 

Walker, A., Ryzhyk, L.: Predicate abstraction for reactive synthesis. Formal Meth. 
Comput. Aided Des. (FMCAD) (2014). https://doi.org/10.1109/FMCAD.2014. 
6987617 

Zappe, J.: Modal p-calculus and alternating tree automata. In: Gradel, E., Thomas, 
W., Wilke, T. (eds.) Automata Logics, and Infinite Games. LNCS, vol. 2500, pp. 
171-184. Springer, Heidelberg (2002). https: //doi.org/10.1007/3-540-36387-4_10 


Causality-Based Game Solving 917 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the 
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


Abate, Alessandro I[I-3 
Agarwal, Pratyush 1-341 
Akshay, S. 1-619 

Albert, Elvira II-863 
Alur, Rajeev 1-249 
Amram, Gal I-870 
André, Étienne 1-552 
Andriushchenko, Roman 1-856 
Arcaini, Paolo I-595 
Armstrong, Alasdair 1-303 
Arquint, Linard 1-367 
Ayoun, Sacha-Elie II-827 


Backes, J. II-851 

Bae, Kyungmin [1-491 
Baier, Christel 1-894 
Baier, Daniel II-195 

Bak, Stanley 1-263 
Balunovic, Mislav I-225 
Bansal, Suguman I-870 
Bardin, Sébastien I-669 
Barrett, Clark -461 
Batz, Kevin [-524 
Baumeister, Jan 1-694 
Bayless, S. I-851 
Bendík, Jaroslav II-313 
Beneš, Nikola 1-505 
Berzish, Murphy II-289 
Beyer, Dirk II-195 

Biere, Armin I-363 
Bodeveix, Jean-Paul I[I-337 
Bonakdarpour, Borzoo 1-694 
Boston, Brett 1-645 
Bozzano, Marco -209 
Bragg, Nate F. F. 1-808 
Breese, Samuel I-645 
Brim, Luboš 1-505 
Brown, Kristopher II-461 
Brunel, Julien [I-337 


Campbell, Brian 1-303 
Carpenter, Taylor 1-249 
Cauli, Claudia I-767 


Author Index 


Češka, Milan 1-856 
Chakraborty, Supratik H-911 
Chalupa, Marek I-887 
Chatterjee, Krishnendu 1-341 
Chemouil, David I-337 

Chen, Guangke I-175 

Chen, Jiayu I-225 

Chen, Mingshuai 1-443, -524 
Chen, Taolue I-175 

Chen, Xiaohong H-477 

Chiari, Michele II-387 
Christakis, Maria I-201, II-777 
Cimatti, Alessandro 1-529, II-209 
Clochard, Martin 1-367 
Coenen, Norine I-694, I-894 
Cogumbreiro, Tiago 1-403 
Constantinides, George II-626 
Cyphert, John 1-46, I-783 


D’ Antoni, Loris I-84, I-783 
DaCosta, D. IH-851 
Dahlqvist, Fredrik II-626 
Dan, Andrei I-225 

Day, Joel D. II-289 

Dodds, Joey 1-645 

Dodds, Mike 1-645 
Dutertre, Bruno I-266 
Dwyer, Matthew B. I-137 


Eilers, Marco I-718 
Eisenhut, Jan I[I-411 
Elad, Neta I-317 
Elbaum, Sebastian I-137 
Eniser, Hasan Ferit I-201 


Fang, Wang I-151 

Farinier, Benjamin 1-669 
Farzan, Azadeh I-832 

Ferlez, James I-287 

Fernandes Pires, Anthony II-209 
Finkbeiner, Bernd I-694, I-894 
Foster, Jeffrey S. 1-808 

Fried, Dror 1-870 

Friedberger, Karlheinz II-195 


920 Author Index 


Fu, Yu-Fu H-149 
Funke, Florian [1-894 


Ganesh, Vijay II-289 
Gardner, Philippa I-827 
Gastin, Paul I-619 
Genaim, Samir II-863 
Giacobbe, Mirco II-3 
Girol, Guillaume 1-669 
Gnad, Daniel II-411 
Goel, Shilpi 1-26 
Gopinath, Divya 1-3 
Griggio, Alberto 1-529, II-209 
Guan, Ji I-151 

Gupta, Aarti II-461 
Gupta, Ashutosh II-911 


Hahn, Ernst Moritz II-651 
Hallé, Sylvain I-500 
Hamilton, Nathaniel I-263 
Hasuo, Ichiro J[-595, II-75 
Hauptman, Dustin 1-566 
Heljanko, Keijo [I-363 
Hermanns, Holger 1-201 
Hobor, Aquinas II-801 
Hoffmann, Jörg 1-201, -411 
Holtzen, Steven I[I-577 
Hu, Qinheping 1-84, I-783 
Huffman, Brian 1-645 
Hur, Chung-Kil II-752 


Immerman, Neil I-317 

Irfan, Ahmed I-529, I-461 
Itzhaky, Shachar I-110, U-125 
Ivanov, Radoslav I-249 


Jacobs, Bart Il-27 
Jacobs, Swen II-435 
Jansen, Nils I-602 
Jantsch, Simon [-894 
Jewell, K. [-851 
Johnson, Andrew  [-380 
Johnson, Taylor T. 1-263 
Jonáš, Martin I-209 
Jones, B. F. II-851 
Joshi, S.  [I-851 
Jovanovic, Dejan I-266 
Junges, Sebastian 1-856, II-553, II-577, 
II-602 


Kaminski, Benjamin Lucien [I-524 
Katoen, Joost-Pieter 1-443, I-856, II-524 
Keshmiri, Shawn 1-566 

Khedr, Haitham [-287 

Kim, Dongjoo I-752 

Kim, Jinwoo I-84 

Kim, Sharon 1-491 

Kimberly, Greg I-209 

Kincaid, Zachary I-46, II-51 
Klaška, David II-887 
Kokologiannakis, Michalis 1-427 
Koskinen, Eric 1-742 

Kothari, Yugesh I-201 

Kovacs, Laura [-317 

Kremer, Gereon [I-23] 
Kulczynski, Mitja [1-289 

Kura, Satoshi Il-75 


Lal, Ratan I-566 

Lange, Julien 1-403 
Launchbury, N. IH-851 
Lee, Insup 1-249 

Lee, Jaehun I-491 

Lee, Juneyoung I-752 
Lefaucheux, Engel II-172 
Leow, Wei Xiang II-801 
Leutgeb, Lorenz I-99 

Li, Jianlin I-201 

Li, Meng 1-767 

Li, Pengfei II-728 

Li, Yangge 1-580 

Lin, Anthony W. H-243 
Lin, Wang 1-467 

Lin, Zhengyao [I-477 
Liu, Jiaxiang II-149 

Liu, Zhiming 1-467 
Lluch Lafuente, Alberto II-411 
Lonsing, Florian I-461 
Lopes, Nuno P. [I-752 
Lopez, Diego Manzanas I-263 
Lyu, Deyun 1-595 


Ma, Lei I-595 

Maksimović, Petar II-827 
Mandrioli, Dino [-387 

Manea, Florin I-289 

Mann, Makai I-461 

Mansur, Muhammad Numair I-777 
Mariano, Benjamin  [I-777 
Markgraf, Oliver I-243 


Martin-Martin, Enrique I-863 
Matheja, Christoph [I-524 
Mathews, N. II-851 
McKinnis, Aaron I-566 
Meel, Kuldeep S.  II-313 
Meier, Severin I-718 
Merayo, Alicia II-863 
Millstein, Todd -577 
Mitra, Sayan 1-580 
Mohan, Anshuman II-801 
Mora, Federico M-289 
Moser, Georg [I-99 
Mover, Sergio 1-529 


Musau, Patrick [-263 


Navas, Jorge A. [-201, II-777 
Nicolet, Victor I-832 
Niemetz, Aina I-231 

Noller, Yannic I-3 

Nowotka, Dirk I-289 


Olveczky, Peter Csaba 1-491 
Oortwijn, Wytse 1-367 
Osama, Muhammad II-447 
Ouaknine, Joël [-172 


Pal, Neelanjana 1-263 
Pappas, George 1-249 
Parthasarathy, Gaurav I-704 
Pasareanu, Corina S. I-3 
Pastva, Samuel I-505 
Pathak, Shreya 1-341 
Pavlogiannis, Andreas 1-341 
Peleg, Hila I-110 

Pereira, João C. 1-367 
Pereira, Mário II-677 
Perez, Mateo II-651 
Petcher, Adam I-645 
Peyras, Quentin I-337 
Piterman, Nir 1-767 
Polikarpova, Nadia I-110 
Prabhakar, Pavithra 1-566 
Pradella, Matteo II-387 
Prakash, Karthik R. I-619 
Preiner, Mathias II-231 
Pulte, Christopher 1-303 
Purser, David Il-172 


Müller, Peter 1-367, I-718, II-704 


Author Index 


Rain, Sophie 1-317 

Rakamarić, Zvonimir I-626 
Ravara, António -677 
Reinhard, Tobias I-27 

Reps, Thomas I-46, I-84, I-783 
Rong, Dennis Liew Zhen 1-403 
Roşu, Grigore I-477 

Roux, Cody 1-808 

Rowe, Reuben N. S. I-110 
Roy, Diptarko Il-3 

Rubio, Albert I-863 

Ryou, Wonryong I-225 


Šafránek, David 1-505 
Sagiv, Mooly 1-317 

Sakr, Mouhammad II-435 
Salvia, Rocco II-626 
Sánchez, César 1-694 
Santos, José Fragoso I-827 
Schewe, Sven II-651 
Schröer, Philipp H-524 
Sergey, Ilya I-110 


Seshia, Sanjit A. II-553, II-577, 1-602 


Sewell, Peter 1-303 

Shi, Xiaomu II-149 
Shoukry, Yasser 1-287 
Shriver, David I-137 
Sibai, Hussein 1-580 
Siber, Julian 1-894 
Simner, Ben 1-303 
Singh, Gagandeep I-225 
Singher, Eytan I-125 
Slobodova, Anna I-26 
Solar-Lezama, Armando I-808 
Somenzi, Fabio II-651 
Song, Fu I-175 

Stan, Daniel [-243 
Stefanescu, Andrei 1-645 
Strejček, Jan II-887 
Stupinský, Šimon 1-856 
Summers, Alexander J. I-704 
Sumners, Rob I-26 

Sun, Youcheng 1-3 
Swords, Sol I-26 


Tabajara, Lucas Martinelli 1-870 


Tang, Xiaochao 1-467 
Terauchi, Tachio 1-742 
Tkachuk, Oksana I-767 
Toman, Viktor I-341 


921 


922 Author Index 


Tomovic, Lukáš II-887 
Tonetta, Stefano 1-529 
Torfah, Hazem I-553 
Tran, Hoang-Dung 1-263 
Tremblay, Hugo II-500 
Trentin, P. II-851 

Trinh, Minh-Thai M-477 
Trivedi, Ashutosh JI-651 
Tsai, Ming-Hsien II-149 


Unadkat, Divyesh II-911 
Unno, Hiroshi 1-742, II-75 
Usman, Muhammad I-3 


Vafeiadis, Viktor I-427 


Van den Broeck, Guy [-577 


van der Berg, Freark I. II-690 


Vardi, Moshe Y. 1-870 
Vazquez-Chanlatte, Marcell 
Vechev, Martin J[-225 


Wahl, Thomas 1-380 
Wang, Bow-Yaw  II-149 
Wang, Qiuye 1-443 

Wang, Yuting I-728 
Weimer, James [-249 
Weiss, Gera 1-870 

Wijs, Anton II-447 
Wojtczak, Dominik I-651 


I-577 


Wolf, Felix A. 1-367 

Worrell, James I-172 

Wu, Jinhua I-728 

Wiistholz, Valentin I-201, -777 


Xu, Xiangzhe — [I-728 
Xue, Bai 1-443 


Yang, Bo-Yin I-149 

Yang, Xiaodong I-263 
Yang, Yahan Il-461 

Yang, Zhengfeng 1-467 
Yin, Zhenguo II-728 

Ying, Mingsheng I-151 
Yu, Emily -363 


Zeng, M. Q. -851 
Zeng, Xia I-467 
Zeng, Zhenbing 1-467 
Zhan, Naijun 1-443 
Zhang, Hongce I-461 
Zhang, Yedi I-175 
Zhang, Yidan 1-467 
Zhang, Zhenya 1-595 
Zhao, Jianjun I-595 
Zhao, Zhe I-175 

Zhu, Shaowei I-51 
Zicarelli, Hannah I-403 
Zuleger, Florian I-99 


